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Editors’ Foreword 



The Seventh International Workshop on Database Programming Languages 
(DBPL99) took place in Kinloch Rannoch, Perthshire, UK from the 1st to the 
3rd of September 1999. This series of workshops focuses on the interaction of 
theory and practice in the design and development of database programming 
languages. The workshop has occurred biennially since 1987, and was previously 
held in: 

Roscoff, Finistere, France (1987) 

Salishan, Oregon, USA (1989) 

Nafplion, Argolida, Greece (1991) 

Manhattan, New York, USA (1993) 

Gubbio, Umbria, Italy (1995) 

Estes Park, Golorado, USA (1997) 

The workshop, as always, was organised as a mixture of invited speakers, 
informal paper presentations and discussion. Attendance at the workshop was 
limited to those who submitted papers and members of the Programme Gom- 
mittee, to ensure a sufficiently small forum for useful discussion. Before finding 
their way into this volume, papers were refereed by at least three members of 
the Programme Gommittee. Sixteen of the 31 submitted papers were accepted 
for presentation at the workshop. In the tradition of the series, authors were en- 
couraged to improve their papers based on both referees’ comments and ensuing 
discussion at the workshop, and resubmit them for publication in this volume, 
after which a further stage of refereeing took place. The result, we believe, is a 
volume of high-quality and well-polished papers. 

Two invited presentations were given, by Luca Gardelli (Microsoft Research 
Labs, Gambridge, UK) and Alon Levy (University of Washington). We are par- 
ticularly grateful to Luca Gardelli for working his presentation into a full paper 
for inclusion in the volume, a task well beyond the call of duty! 

The sessions of the workshop were arranged under the following headings: 

Querying and query optimisation 
Languages for document models 
Persistence, components and workflow 
Typing and querying semi-structured data 
Active and spatial databases 

Unifying semi-structured and traditional data models 

It is interesting to note that the subject area of the workshop represents a sig- 
nificant departure from previous workshops. All of the papers are concerned with 
data-intensive computational systems. However, the number of papers roughly 
arranged by category of interest are as follows: 
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Active databases 1 

Interoperability 1 

Persistence 1 

Relational models 2 

Semi-structured data 8 
Spatial databases 2 

Workflow models 1 



This is a fairly typical spread of interest for a DPBL workshop, except for 
the sudden emergence of semistructured data as a major theme. Databases, 
as defined in any text book, deal with significantly large collections of highly 
structured data. However, it seems that the DBPL community has implicitly 
decided that semi-structured data, traditionally viewed as unstructured from a 
database perspective, is now a major theme within the database research domain. 

The workshop sessions contained the following papers: 



Invited talk: semi-structured computation 

In this paper Cardelli shows how his work on mobile ambient systems can be 
transferred to the domain of semi-structured data. The key observation is that 
both contexts are based upon imperfect knowledge of labeled graphs, and the 
paper gives an insight into a radically new model for computation over semi- 
structured data. 



Querying and query optimisation 

Libkin and Wong discuss conditions under which it is possible to evaluate certain 
database queries in the context of query languages that do not allow their explicit 
definition. This may be achieved by the incremental maintenance of the query 
result over changes to the data, rather than by a defined computation over the 
current given state. 

Aggelis and Cosmodakis show an optimisation method for nested SQL query 
blocks with aggregation operators, derived from the theory of dependency im- 
plication. In some cases this allows the merging of MAX, MIN blocks to allow 
the same optimisation strategy as tableau equivalence to be used. 

Grahne and Waller consider string databases, which they define as a col- 
lection of tables, the columns of which contain strings. They address the issue 
of designing a simple query language for string databases, based on a simple 
first-order logic extended by a concatenation operator. 



Languages for document models 

Maneth and Neven introduce a document transformation language, with similar 
expressive power to XSL, using regular expressions. A further language is in- 
troduced which replaces simple pattern matching by monadic second-order logic 
formulae. Various properties of this language are investigated. 
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VII 



Neven contrasts document models defined using extended context-free gram- 
mars (in which the right-hand side of expansions may contain regular expres- 
sions) with standard context-free grammars. An important difference is the abil- 
ity to order child nodes. The investigation is into extensions of attribute gram- 
mars that may be usefully applied within the extended context. 



Persistence, components and workflow 

Mclver et al. address the inherent problems of the application of the component- 
ware paradigm in the context of databases. They introduce Souk, a language- 
independent paradigm for performing data integration, designed to allow the 
rapid construction of integrated solutions from off-the-shelf components. 

Printezis, Atkinson and Jordan investigate the pragmatic issue of the misuse 
of the transient keyword within the Java^ language. Originally intended to allow 
explicit closure severance within persistent versions of the language, it is now 
multiply interpreted by different implementations, allowed because of the loose 
definition of the language. The paper shows why most current interpretations 
are inappropriate and describes a more useful one for the context of a persistent 
Java system. 

Dong et al. show a method for translating distributed workflow schemata into 
a family of communicating flowcharts, which are essentially atomic and execute 
in parallel. Semantics-preserving transformations over these sets of flowcharts 
can be used to optimise the overall workflow according to the physical infras- 
tructure available for its execution. 



Typing and querying semi-structured data 

Bergholz and Freytag discuss the querying of semi-structured data. They propose 
that queries may be divided into two parts, the first part deriving a match 
between the data and a partial schema, the second part manipulating that part 
of the data that matches the schema. The first part of the query can be re-used 
for a number of different queries requiring the same structure. 

Buneman and Pierce investigate a new use of the unlabelled union type for 
typing semi-structured data. This overcomes the problems of the normal strategy 
of combining typed data sources in a semi-structured collection, which is to throw 
away all the existing type information. The union treatment shown allows type 
information, albeit in a weakened form, to be maintained without losing the 
inherent flexibility of the semi-structured format. 

Buneman, Fan and Weinstein concentrate on a restricted semi-structured 
data model, where outgoing edges are constrained to have unique labels. In this 
model, which is representative of a large body of semi-structured collections, 
many path constraint problems, undecidable in the general model, are decidable. 
The limits of these results are studied for some different classes of path constraint 
language. 

^ Java is a trademark of Sun Microsystems. 
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Active and spatial databases 

Geerts and Kuijpers are interested in 2-dimensional spatial databases defined by 
polynomial inequalities, and in particular in the issue of topological connectivity. 
This is known not to be first-order expressible in general. They show a spatial 
Datalog program which tests topological connectivity for arbitrary closed and 
bounded spatial databases, and is guaranteed to terminate. 

Kuper and Su show extensions to linear constraint languages which can ex- 
press Euclidean distance. The operators under study work directly on the data, 
unlike previous work which depends upon the data representation. 

Bailey and Poulovassilis consider the termination of rules, which is a critical 
requirement for active databases. This paper shows an abstract interpretation 
framework which allows the modeling of specific approximations for termina- 
tion analysis methods. The framework allows the comparison and verification of 
different methods for termination analysis. 

Unifying semi-structured and traditional data models 

Granhe and Lakshmanan start from the observation that the state-of-the-art in 
semi-structured querying is based on navigational techniques, which are inher- 
ently detached from standard database theory. First, the semantics of querying is 
not entirely defined through the normal input/output typing of queries. Second, 
the notion of genericity is largely unaddressed within the domain, and indeed the 
emerging trend is for query expressions to be dependent on a particular instance 
of a database. 

Lahiri et al. investigate an integration of structured and semi-structured 
databases. They describe Ozone, a system within which structured data may 
contain references to semi-structured, and vice versa. The main contribution is 
towards the unification of representing and querying such hybrid data collections. 
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Semistructured Computation 



Luca Cardelli 
Microsoft Research 



1 Introduction 

This paper is based on the observation that the areas of semistructured databases [1] 
and mobile computation [3] have some surprising similarities at the technical level. 
Both areas are inspired by the need to make better use of the Internet. Despite this com- 
mon motivation, the technical similarities that arise seem largely accidental, but they 
should still permit the transfer of some techniques between the two areas. Moreover, if 
we can take advantage of the similarities and generalize them, we may obtain a broader 
model of data and computation on the Internet. 

The ultimate source of similarities is the fact that both areas have to deal with ex- 
treme dynamicity of data and behavior. In semistructured databases, one cannot rely on 
uniformity of structure because data may come from heterogeneous and uncoordinated 
sources. Still, it is necessary to perform searches based on whatever uniformity one can 
find in the data. In mobile computation, one cannot rely on uniformity of structure be- 
cause agents, devices, and networks can dynamically connect, move around, become 
inaccessible, or crash. Still, it is necessary to perform computations based on whatever 
resources and connections one can find on the network. 

We will develop these similarities throughout the paper. As a sample, consider the 
following arguments. First, one can regard data structures stored inside network nodes 
as a natural extension of network structures, since on a large time/space scale both net- 
works and data are semistructured and dynamic. Therefore, one can think of applying 
the same navigational and code mobility techniques uniformly to networks and data. 
Second, since networks and their resources are semistructured, one can think of apply- 
ing semistructured database searches to network structure. This is a well-known major 
problem in mobile computation, going under the name of resource discovery. 

2 Information 

2.1 Representing Dynamic Information 

In our work on mobility [3, 5] we have been describing mobile structures in a variety 
of related ways. In all of these, the spatial part of the structure can be represented ab- 
stractly as an edge-labeled tree. 

For example, the following figure shows at the top left a nested-blob representation 
of geographical information. At the bottom left we have an equivalent representation in 
the nested-brackets syntax of the Ambient Calculus [5]. When hierarchical information 
is used to represent document structures, a more appropriate graphical representation is 
in terms of nested folders, as shown at the bottom right. Finally, at the top right we have 
a more schematic representation of hierarchies in terms of edge-labeled trees. 



R. Connor and A. Mendelzon (Eds.): DBPL’99, LNCS 1949, pp. 1—16, 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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Geographical maps 

Earth 




Expressions 

Earth[US[...] | EU[UK[...] | ...] ...] 



Edge-labeled trees 



Earth 




Eolders 



We have studied the Ambient Calculus as a general model of mobile computation. 
The Ambient Calculus has so far been restricted to edge-labeled trees, but it is not hard 
to imagine an extension (obtained by adding recursion) that can represent edge-labeled 
directed graphs. As it happens, edge-labeled directed graphs are also the favorite repre- 
sentation for semistructured data [1]. So, basic data structures used to represent semis- 
tructured data and mobile computation, essentially agree. Coincidence? 

It should be stressed that edge-labeled trees and graphs are a very rudimentary way 
of representing information. For example, there is no exact representation of record or 
variant data structures, which are at the foundations of almost all modem programming 
languages. Instead, we are thrown back to a cmde representation similar to LISP’s S- 
expressions. 

The reason for this step backward, as we hinted earlier, is that in semistmctured da- 
tabases one cannot rely on a fixed number of subtrees for a given node (hence no 
records) and one cannot even rely of a fixed set of possible shapes under a node (hence 
no variants). Similarly, on a network, one cannot rely on a fixed number of machines 
being alive at a given node, or resources being available at a given site, nor can one mle 
out arbitrary network reconfiguration. So, the similarities in data representation arise 
from similarities of constraints on the data. 

In the rest of this section we discuss the representation of mobile and semistmctured 
information. We emphasize the Ambient Calculus view of data representation, mostly 
because it is less well known. This model arose independently from semistmctured da- 
ta; it can be instractive to see a slightly different solution to what is essentially the same 
problem of dynamic data representation. 

2.2 Information Expressions and Information Trees 

We now describe in more detail the syntax of information expressions; this is a subset 
of the Ambient Calculus that concerns data stmctures. The syntax is interpreted as rep- 
resenting finite-depth edge-labeled unordered trees; for short: information trees. 

The tree that consists just of a root node is written as the expression 0: 
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0 represents 

A tree with a single edge labeled n from the root, leading to a subtree represented by 
P, is written as the expression n[P]: 

n[P\ represents 

A tree obtained by joining two trees, represented by P and Q, at the root, is written 
as the expression P \ Q. 

P I Q represents 



A tree obtained by joining an infinite number of equal trees, represented by P, at the 
root, is written as the expression \P. (This ean be used to represent abstraetly unbounded 
resourees.) 

!P represents 



The deseription of trees in this syntax is not unique. For example the expressions 
P I Q and Q \ P represent the same (unordered) tree; similarly, the expressions 0 | P and 
P represent the same tree. More subtle equivalences govern !. We will consider two ex- 
pression equivalent when they represent the same tree. 

The Ambient Calculus uses these tree structures to describe mobile computation, 
which is seen as the evolution of tree structures over time. The following figure gives, 
first, a blob representation of an agent moving from inside node a to inside node b, with 
an intermediate state where the agent is traveling over the network. 







a[agent[...]\ \ fe[0] ^ a[0] | agent[...\ \ fc[0] ^ a[0] | fe[agent[...]] 

Then, the same situation is represented as transformation of information trees, where hi- 
erarchy represents containment and the root is the whole network. Finally, the same sit- 
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uation is represented again as transformation of information expressions. The Ambient 
Calculus has additional syntax to represent the actions of the agent as it travels from a 
to b (indicated here by we will discuss these actions later. 

Note that information trees are not restricted to be finite-branching. For example, 
the following information tree describes, in part, the city of Cambridge, the Cambridge 
Eagle pub, and within the pub two empty chairs and an unbounded number of full glass- 
es of beer. 




This tree can be represented by the following expression: 

Cambridge[Eagle[chair[0] \ chair[0] \ !g/ai5[/7i«t[0]]] | ...] 

Here is another example: an expression representing the (invalid!) fact that in Cam- 
bridge there is an unlimited number of empty parking spaces: 

Cambridge[\ParkingSpace[0\ \ ...] 

Equivalence of information trees can be characterized fairly easily, even in presence 
of infinite branching. Up to the equivalence relation induced by the following set of 
equations, two information expressions are equivalent if and only if they represent the 
same information tree [9]. Because of this, we will often confuse expressions with the 
trees they represent. 

P\Q = Q\P \(P\Q) = \P\\Q 

(P\Q)\R = P\(Q\R) 10 = 0 

B|0 = P \P = P\\P 

\P^\\P 

In contrast to our information trees, the standard model of semi structured data con- 
sists oi finitely -branching edge-labeled unordered directed graphs. There is no notion 
of unbounded resource there, but there is a notion of node sharing that is not present in 
the Ambient Calculus. It should be interesting to try and combine the two models; it is 
not obvious how to do it, particularly in terms of syntactical representation. Moreover, 
the rules of equivalence of graph structures are more challenging; see Section 6.4 of [1]. 

2.3 Ambient Operations 

The Ambient Calculus provides operations to describe the transformation of data. In the 
present context, the operations of the Ambient Calculus may look rather peculiar, be- 
cause they are intended to represent agent mobility rather than data manipulation. We 
present them here as an example of a set of operations on information trees; other sets 
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of operations are conceivable. In any case, their generalization to directed graphs does 
not seem entirely obvious. 

Information expressions and information trees are a special case of ambient expres- 
sions and ambient trees', in the latter we can represent also the dynamic aspects of mo- 
bile computation and mutable information. An ambient tree is an information tree 
where each node in the tree may have an associated collection of concurrent threads that 
can execute certain operations. The fact that threads are associated to nodes means that 
the operations are “local”: they affect only a small number of nodes near the thread node 
(typically three nodes). In our example of an agent moving from a to b, there would usu- 
ally be a thread in the agent node (the node below the agent edge) that is the cause of 
the movement. 




Therefore, the full Ambient Calculus has both a spatial and a temporal component. 
The spatial component consists of information trees, that is, semistructured data. The 
temporal component includes operations that locally modify the spatial component. 
Rather than giving the syntax of these operations, we describe them schematically be- 
low. The location of the thread performing the operations is indicated by the thread 
icon. 

The operation in n, causes an ambient to enter another ambient named n (i.e., it caus- 
es a subtree to slide down along an n edge). The converse operation, out n, causes an 
ambient to exit another ambient named n (i.e., it causes a subtree to slide up along an n 
edge). The operation open n opens up an ambient named n and merges its contents (i.e., 
it collapses an edge labeled n); these contents may include threads and subtrees. Finally, 
the spawning operation creates a new configuration within the current ambient (i.e., it 
creates a new tree and merges its root with the current node). 
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It should be clear that, by strategically placing agents on a tree, we can rearrange, 
collapse, and expand sections of the tree at will. 

2.4 Summary 

We have seen that there are some fundamental similarities of data representation in the 
areas of semistructured data and mobile computation. Moreover, in the case of mobile 
computation, we have ways of describing the manipulation of data. (In semistructured 
database, data manipulation is part of the query language, which we discuss later.) 

3 Data Structures 

We discuss briefly how traditional data structures (records and variants) fit into the sem- 
istructured data and ambients data models. 

3.1 Records 

A record r is a structure of the form {l\^v \, ..., where /, are distinct labels and v, 

are the associated values; the pairs are called record fields. Field values can be ex- 
tracted by a record selection operation, r./„ by indexing on the field labels. 

Semistructured data can naturally represent record-like structures: a root node rep- 
resents the whole record, and for each field Z,=v„ the root has an edge labeled Z, leading 
to a subtree v,. Record fields are unordered, just like the edges of our trees. However, 
semistructured data does not correspond exactly to records: labels in a record are 
unique, while semistructured data can have any number of edges with the same label 
under a node. Moreover, records usually have uniform structure throughout a given col- 
lection of data, while there is no such uniformity on semistructured data. 

It is interesting to compare this with the representation of records in the Ambient 
Calculus. There, we represent records {Zi=vi, ..., as: 

where r is the name (address) of the record, which is used to name an ambient r[ ... ] 
representing the whole record. This ambient contains subambients /i[...] ... /«[...] repre- 
senting labeled fields (unordered because | is unordered). The field ambients contain the 
field values vi, ..., v„ and some machinery (omitted here) to allow them to be read and 
rewritten. 

However, ambients represent mobile computation. This means that, potentially. 
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field subambients can take off and leave, and new fields can arrive. Moreover, a 
new field can arrive that has the some label as an existing field. In both cases, the stable 
structure of ordinary records is destroyed. 

3.2 Variants 

A variant v is a structure of the form [/=v], where / is a label and v is the associated val- 
ue, and where I is restricted to be a member of a finite set of labels l\ ... A case anal- 
ysis operation can be used to determine which of these labels is present in the variant, 
and to extract the associated value. 

A variant can be easily represented in semistructured data, as an edge labeled / lead- 
ing to a subtree v, with the understanding that / is a unique edge of its parent node, and 
that / is a member of a finite collection ... l„. But the latter restrictions are not enforced 
in semistructured data. A node meant to represent a variant could have zero outgoing 
edges, or two or more edges with different labels, or even two or more edges with the 
same label, or an edge whose label does not belong to the intended set. In all these sit- 
uations, the standard case analysis operation becomes meaningless. 

A similar situation happens, again, in the case of mobile computation. Even if the 
constraints of variant structures are respected at a given time, a variant may decide to 
leave its parent node at some point, or other variants may come to join the parent node. 

3.3 Summary 

We have seen that fundamental data structures used in programming languages be- 
comes essentially meaningless both in semistructured data and in mobile computation. 
We have discussed the untyped situation here, but this means in particular that funda- 
mental notions of types in programming languages become inapplicable. We discuss 
type systems next. 

4 Type Systems 

4.1 Type Systems for Dynamic Data 

Because of the problems discussed in the previous section, it is quite challenging to de- 
vise type systems for semistructured data or mobile computation. Type systems track 
invariants in the data, but most familiar invariants are now violated. Therefore, we need 
to find weaker invariants and weaker type systems that can track them. 

In the area of semistructured data, ordinary database schemas are too rigid, for the 
same reasons that ordinary type systems are too rigid. New approaches are needed; for 
example, union types have been proposed [2]. Here we give the outline of a different 
solution devised for mobile computation. Our task is to find a type system for the infor- 
mation trees of Section 2, subject to the constraint that information trees can change dy- 
namically, and that the operations that change them must be typeable too. 

4.2 A Type System for Information Trees 

The type system we present here may appear to be very weak, in the sense of imposing 
very few constraints on information trees. However, this appearance is deceptive: with- 
in this type system, when applied to the full Ambient Calculus, we can represent stan- 
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dard type systems for the X-calculus and the Ti-calculus [6]. Moreover, more refined 
type systems for mobility studied in [4] enforce more constraints by forcing certain sub- 
structures to remain “immobile”. Here we give only an intuitive sketch of the type sys- 
tem; details can be found in [6]. 

The task of finding a type systems for information trees is essentially the same as 
the task of finding a type system for ordinary hierarchical file systems. Imagine a file 
system with the following constraints. First, each folder has a name. Second, each name 
has an associated data type (globally). Third, each folder of a given name can contain 
only data of the type associated with its name. Fourth, if there is a thread operating at a 
node, it can only read and write data of the correct type at that node. Fifth, any folder 
can contain any other kind of folder (no restrictions). 

In terms of information trees, these rules can be depicted as follows. Here we add 
the possibility that the nodes of information tree may contain atomic data (although in 
principle this data can also be represented by trees): 




all n edges have type T 



only atomic data of type T at this node 



arbitrary subtree 



Next, we need to examine the operations described in section 2.3 (or any similar set 
of operations) to make sure they can be typed. The type system can easily keep track of 
the global associations of types to names. Moreover, we need to type each thread ac- 
cording to the type of data it can read, write, or merge (by performing open) at the cur- 
rent node. 

The in and out operations change the structure of the tree (which is not restricted by 
the type system) but do not change the relationship between an edge and the contents of 
the node below it; so no type invariant is violated. The open operation, though, merges 
the contents of two nodes. Here the type system must guarantee that the labels above 
those two nodes have the same type; this can be done relatively easily, by keeping track 
of the type of each thread, as sketched above. Finally, the spawn operation creates a new 
subtree, so it must simply enforce the relationship between the edges it creates and the 
attached data. 

This is a sensible type system in the sense that it guarantees well-typed interactions: 
any process that reads or writes data at a particular node (i.e., inside a particular folder) 
can rely on the kind of data it will find there. On the other hand, this type system does 
not constrain the structure of the tree, therefore allowing both heterogeneity (for semi- 
structured data) and mutability (for mobile computation). 

Note also that this type system does not give us anything similar to ordinary record 
types. Folder types are both weaker than record types, because they do not enforce uni- 
formity of substructures, and stronger, because they enforce global constraints on the 
typing of edges. 
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4.3 Summary 

Because of the extreme dynamicity present both in semistructured data and in mobile 
computation, new type systems are needed. We have presented a particular type system 
as an example of possible technology transfers: we have several ready-made type sys- 
tems for mobile computation that could be applicable to semistructured data. 

5 Queries 

Semistructured databases have developed flexible ways of querying data, even though 
the data is not rigidly structured according to schemas [1]. In relational database theory, 
query languages are nicely related to query algebras and to query logics. However, que- 
ry algebras and query logics for semistructured database are not yet well understood. 

For reasons unrelated to queries, we have developed a specification logic for the 
Ambient Calculus [7]. Could this logic, by an accident of fate, lead to a query language 
for semistructured data? 

5.1 Ambient Logic 

In classical logic, assertions are simply either true or false. In modal logic, instead, as- 
sertions are true or false relative to a state (or world). For example, in epistemic logic 
assertions are relative to the knowledge state of an entity. In temporal logic, assertions 
are relative to the execution state of a program. In our Ambient Logic, which is a modal 
logic, assertions are relative to the current place and the current time. 

As an example, here is a formula in our logic that makes an assertion about the shape 
of the current location at the current time. It is asserting that right now, right here, there 
is a location called Cambridge that contains at least a location called Eagle that contains 
at least one empty chair (the formula 0 matches an empty location; the formula T 
matches anything): 

Cambridge{Eagle{chair\(f\ | T ] | T] 

This assertion happens to be true of the tree shown in Section 2.2. However, the tmth 
of the assertion will in general depend on the current time (is it happy hour, when all 
chairs are taken?) and the current location (Cambridge England or Cambridge Mass.?). 

Formulas of the Ambient Logic 





a name n or a variable x 


S?, ^ : O ::= 


T 


true 




negation 




disjunction 


0 


void 




location 


9l\% 


composition 




somewhere modality 


091 


sometime modality 
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location adjunct 

composition adjunct 

universal quantification over names 



S?@r| 

! 

More generally, our logic includes both assertions about trees, such as the one 
above, and standard logical connectives for composing assertions. The following table 
summarizes the formulas of the Ambient Logic. The first three lines give classical prop- 
ositional logic. The next three lines describe trees. Then we have two modal connective 
for assertions that are true somewhere or sometime. After the two adjunctions (dis- 
cussed later) we have quantification over names, giving us a form of predicate logic; the 
quantified names can appear in the location and location adjunct constructs. 

5.2 Satisfaction 

The exact meaning of logical formulas is given by a satisfaction relation connecting a 
tree with an formula. The term satisfaction comes from logic; for reasons that will be- 
come apparent shortly, we will also call this concept matching. The basic question we 
consider: is this formula satisfied by this tree? Or: does this tree match this formula? 

The satisfaction relation between a tree P (actually, an expression P representing a 
tree) and a formula S? is written: 

P\=9l 

For the basic assertions on trees, the satisfaction/matching relation can be described 
as follows; for graphical effect we relate tree shapes to formulas: 

• 0: here now there is absolutely nothing: 

* matches 0 

• n[Sf\: here now there is one edge called n, whose descendant satisfies the formula S?: 

matches n{Sf\ if P matches 91. 



• 9l\fi\ here now there are exactly two things next to each other, one satisfying S? and 
one satisfying 

matches 9l\9> if F" matches S? and Q matches H 

(or if P matches fi and Q matches 9!) 

• somewhere now, there is a place satisfying 91: 
matches ^5? 






if P matches 91 (i.e., there must be a 
subtree P that matches 91) 
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• OS?: here sometime, there is a thing satisfying S?, after some reductions: 




matches OS? 



if 




and P’ matches S? 



The propositional connectives and the universal quantifier have fairly standard in- 
terpretations. A formula — iS? is satisfied by anything that does not satisfy S?. A formula 
S? V ‘S is satisfied by anything that satisfies either S? or ‘S. Anything satisfies the for- 
mula T, while nothing satisfies its negation, F, defined as — iT. A formula Vx.S? is satis- 
fied by a tree P if for all names n, the tree P satisfies S? where x is replaced by n. 

Many useful derived connectives can be defined from the primitive ones. Here is a 
brief list: 

• Normal Implication: S? => ‘S = —91 v 9>. This is the standard definition, but note 
that in our modal logic this means that P matches S? => if whenever P matches S? 
then the same P matches at the same time and in the same place. As examples, 
consider Borders{T} BordersYStarbucks^Y} \ T], stating that a Borders bookstore 
contains a Starbucks shop, and (NonSmoker[T] \ T) => (NonSmoker[T] \ Smoker[T] 

I T), stating that next to a non-smoker there is a smoker. 

• Everywhere'. 'H9l = -n9—9I. What is true everywhere? Not much, unless we 
qualify it. We can write H (S? => ‘B) to mean that everywhere S? is true, H is true as 
well. For example, US['A{BordersY^'\ BordersiStarbucksiT} \ T])]. 

• Always: nS? = — lO— iS?. This can be used to express temporal invariants, such as: 
nPisa[LeaningTower[Y] \ T]. 

• Parallel Implication: S? |=> ‘S = — i(S? | — I'B). This means that it is not possible to 
split the root of the current tree in such a way that one part satisfies S? and the other 
does not satisfy In other words, every way we split the root of the current tree, if 
one part satisfies S?, then the other part must satisfy For example, Bath[ H (Non- 
Smoker[T] Smoker[T] \ T)] means that at the Bath pub, anywhere there is a non- 
smoker there is, nearby, a smoker. Note that parallel implication makes the defini- 
tion of this property a bit more compact than in the earlier example about smokers. 

• Nested Implication: n[=>S?] = — in[— 15 ?]. This means that it is not possible that the 
contents of an n location do not satisfy 9. In other words, if there is an n location, 
its contents satisfy S?. For example: US['HBorders[^Starbucks\Y'\ \ T]]; again, this 
is a bit more compact than the previous formulation of this example. 

5.3 Adjunctions 

The adjunction connectives, S?>‘B and S?@n, are of special interest; they are the logical 
inverses, in a certain sense, oi9\9) and n[S?] respectively. In ordinary logic, we have 
a fundamental adjunction between conjunction and implication given by the property: 
S?a‘S entails C iff S? entails ‘B=>C. Similarly, in our logic we have that S? | % entails C 
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iff S? entails '3>C, and that n[S?] entails C iff S? entails C@n. We now explore the ex- 
plieit meaning of these adjunetions. 

The formula means that the tree present here and now satisfies the formula 93 
when it is merged at the root with any tree that satisfies the formula S?. We ean think of 
this formula as a requirement/guarantee specification: given any context that satisfies 
S?, the combination of that context with the current tree will satisfy ‘S. 




matches 



if for all 




that match S? 



we have that 




matches 93 



For example, consider a representation of a fish consisting of a certain structure (begin- 
ning with fish[...]), and a certain behavior. A prudent fish would satisfy the following 
specification, stating that even in presence of bait, the bait and the fish remain separate: 

fish[...] 1= bait[T] > D(fish[T] \ bait[T\) 

On the other hand, a good bait would satisfy the following specification, stating that in 
presence of a fish, it is possible that the fish will eventually ingest the bait: 

bait[...] 1= fish[T] > <>fish[bait[T] \ T] 



These two specifications are, of course, incompatible. In fact, it is possible to show 
within our logic that, independently of any implementation offish and bait, the compo- 
sition of the fish spec with the bait spec leads to a logical contradiction. 

The formula C@n means that the tree present here and now satisfies the formula C 
when it is placed under an edge named n. This is another kind of requirement/guarantee 
specification, regarding nested contexts instead of parallel contexts: even when 
"thrown" inside an n context, the current tree will manage to satisfy the property C. 




matches 



if 



\n 

I matches C 

A 



For example, an aquarium fish should satisfy the following property, stating that the fish 
will survive when placed in a (persistent) tank: 



{ntank[fish[T] \ T]) @ tank 



5.4 From Satisfaction to Queries 

A satisfaction relation, such as the one defined in the previous section, is not always de- 
cidable. However, in our case, if we rule out the IP operator on trees, which describes 
infinite configurations, and also the S?>‘S formulas, which involve a quantification over 
an infinite set of trees, then the problem of whether P 1= S? becomes decidable [7]. A 
decision procedure for such a problem is also called a modelchecking algorithm. Such 
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an algorithm implements essentially a matching procedure between a tree and a formu- 
la, where the result of the match is just success of failure. 

For example, the following match succeeds. The formula can be read as stating that 
there is an empty chair at the Eagle pub; the matching process verifies that this fact 
holds in the current situation: 

Eagle[chair[John[0]] \ chair[Mary[OJ] \ chair[{f\\ 

1= Eagle{chair[(f\ \ T] 

More generally, we can conceive of collecting information during the matching pro- 
cess about which parts of the tree match which parts of the formula. Further, we can en- 
rich formulas with markers that are meant to be bound to parts of the tree during 
matching; the result of the matching algorithm is then either failure or an association of 
formula markers to the trees that matched them. 

We thus extend formulas with matching variables, 9f, which are often placed where 
previously we would have placed a T. For example by matching: 

Eagle[chair[John[OJ] \ chair[Mary[OJ] \ c/iair[0]] 

1= Eagle[chair[9(] \ T] 

we obtain, bound to 9f, either somebody sitting at the Eagle, or the indication that there 
is an empty chair. Moreover, by matching: 

Eagle[chair[John[Q\\ \ chair[Mary[0\\ \ chair[{H\\ 

1= Eagle{chair[{-^V)/\^C\ \ T] 

we obtain, bound to 9f, somebody (not 0) sitting at the Eagle. Here the answer could be 
either John\G\ or Mary[0], since both lead to a successful global match. Moreover, by 
using the same variable more than once we can express constraints: the formula Ea- 
gle[chair[{—fy)/\'^ \ c/za;>[9f| | T] is successfully matched if there are two people with 
the same name sitting at the Eagle. 

These generalized formulas that include matching variables can thus be seen as que- 
ries. The result of a successful matching can be seen as a possible answer to a query, 
and the collection of all possible successful matches as the collection of all answers. 

For serious semistructured database applications, we need also sophisticated ways 
of matching names (e.g. with wildcards and lexicographic orders) and of matching 
paths of names. For the latter, though, we already have considerable flexibility within 
the existing logic; consider the following examples: 

• Exact path. The formula n{m{p\^X\\ \ T] means: match a path consisting of the names 
n, m, p, and bind 9f to what the path leads to. Note that, in this example, other paths 
may lead out of n, but there must be a unique path out of m and p. 

• Dislocated path. The formula «[^(m[9f] | T)] means: match a path consisting of a 
name n, followed by an arbitrary path, followed by a name m; bind 9f to what the 
path leads to. 

• Disjunctive path. The formula n[/?[9(]] v m[p[9f]] means: bind 9f to the result of fol- 
lowing either a path n,p, or a path m,p. 
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• Negative path. The formula ^m[— i(/7[T] | T) | ^[9f]] means: bind 9f to anything 
found somewhere under m, inside a q but not next to a p. 

• Wildcard and restricted wildcard. m[3y.y^n a y[9f|] means: match a path consisting 
of m and any name different from n, and bind 9f to what the path leads to. (Inequality 
of names can be expressed within the logic [7]). 

5.5 Adjunctive Queries 

Using adjunctions, we can express queries that not only produce matches, but also re- 
construct a results. 

Consider the query: 

m[9f@n] 

This is matched by a tree m[P] if P matches 9f@n. By definition of P matching 9f@n, 
we must verify that n[P] matches 9f. The latter simply causes the binding of 9fto n[P], 
and we have this association as the result of the query. Note that n[P] is not a subtree of 
the original tree: it was constructed by the query process. A similar query, 
^m[9f@^@n], means: if somewhere there is an edge m, wrap its contents P into 
q[n[P]], and return that as the binding for 9f. 

Consider now the query 

n[0]>9f 

We have that P matches «[0]>9f if for all Q that match n[0], P \ Q matches 9f. This im- 
mediately gives a result binding oiP\Q for 9f. But what is Q1 Fortunately there is only 
one Q that matches the formula «[0], and that is the tree n[0]. So, this query has the fol- 
lowing meaning: compose the current tree with n[0], and give that as the binding of 9f. 
Note, again, that this composition is not present in the original tree: it is constructed by 
the query. In this particular case, the infinite quantification over all Q does not hurt. 
However, as we mentioned above, we do not have a general matching algorithm for >, 
so we can at best handle some special cases. 

It is not clear yet how much expressive power is induced by adjunctive queries, but 
the idea of using adjunctions to express query-and-recombination seems interesting, 
and it comes naturally out of an existing logic. It should be noted that basic questions 
of expressive power for semistructured database query languages are still open. 

In other work [8], we are using a more traditional SQL-style select construct for con- 
structing answers to queries. The resulting query language seems to be very similar to 
XML-QL [1], perhaps indicating a natural convergence of query mechanisms. Howev- 
er, it is also clear that new and potentially useful concepts, such as adjunctive queries, 
are emerging from the logical point of view. 

5.6 Summary 

We have seen that what was originally intended as a specification logic for mobile sys- 
tems can be interpreted (with some extension) as a powerful query language for semis- 
tructured data. Conversely, although we have not discussed this, well-known efficient 
techniques for computing queries in databases can be used for modelchecking certain 
classes of mobile specifications. 
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6 Update 

Sometimes we wish to change the data. These changes can be expressed by computa- 
tional processes outside of the domain of databases and query languages. For example, 
we can use the Ambient Calculus operations described in Section 2.3 to transform trees. 
In general, if we have a fully worked-out notion of semistructured computation, instead 
of just semistructured data, then we already have a notion of semistructured update. 

In database domains, however, we may want to be able express data transformations 
more declaratively. For example, transformations systems based on tree grammar trans- 
ducers have been proposed for XML. It turns out that in our Ambient Logic we also 
have ways of specifying update operations declaratively, as we now discuss. 

6.1 From Satisfiability to Update 

In the examples of queries given so far we have considered only a static notion of 
matching. Remember, though, that we also have a temporal operator in the logic, O^, 
that requires matching S? after some evolution of the underlying tree. If we want to talk 
about update, we need to say that right now, we have a certain configuration, and later, 
we achieve another configuration. 

To this end, we consider a slightly different view of the satisfaction problem. So far 
we have considered questions of the form P 1= S? when both P and S? are given. Consider 
now the case where only ^ is given, and where we are looking for a tree that satisfies 
it; we can write this problem as X 1= In some cases this is easy: any formula construct- 
ed only by composing 0, n[S?], and ^ \ 93 operations is satisfied by a unique tree. If other 
logical operators are used, the problem becomes harder (possibly undecidable). 

Consider, then, the problem X 1= S?t>0‘B. By definition, we have that X matches 
S?>093 if when composed with any tree P that matches S?, the composition P \ X can 
evolve into a tree that satisfies Therefore, whatever X is, it must be something that 
transforms a tree satisfying S? into a tree satisfying 93. In other words, X is a mutator of 
arbitrary S? trees into 93 trees, and X 1= is a specification of such a mutator. 

So, we can see X 1= S?>093 as an inference problem where we are trying to synthe- 
size an appropriate mutator. We believe that this is very much in the database style, 
where transformations are often specified declaratively, and synthesized by sophisticat- 
ed optimizers. Of course, this problem can be hard. Alternatively, if we have a proposed 
mutator P to transform ^ trees into “3 trees, we can try to verify the property P 1= 
S?>093, to check the correctness of the mutator. 

6.2 Summary 

We have seen that query languages for semistructured data and specification logics for 
mobility can be related. In one direction, this can gives us new query languages for sem- 
istructured data, or at least a new way of looking at existing query languages. In the oth- 
er direction, this can gives us modelchecking techniques for mobility specifications. 




16 Luca Cardelli 



Conclusions 

In conclusion, we have argued that semistructured data and mobile computation are nat- 
urally related, because of a hidden similarity in the problems they are trying to solve. 

From our point of view, we have discovered that the Ambient Calculus can be seen 
as a computational model over semistructured data. As a consequence, type systems al- 
ready developed for the Ambient Calculus can be seen as weak schemas for semistruc- 
tured data. Moreover, the Ambient Logic, with some modifications, can be seen as a 
query language for semistructured data. 

We have also discovered that it should be interesting to integrate ideas and tech- 
niques arising from semistructured databases into the Ambient Calculus, and in mobile 
computation in general. For example, the generalization of the Ambient Calculus to 
graph structures, the use of database techniques for modelchecking, and the use of sem- 
istructured query languages for network resource discovery. 

We hope that, conversely, people in the semistructured database community will 
find this connection interesting, and will be able to use it for their own purposes. Much, 
of course, remains to be done. 
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Abstract. We consider lES(iSQr), the incremental evaluation system 
over an SQL-like language with grouping, arithmetics, and aggregation. 
We show that every second order query is in lES(iSQr) and that there 
are PSPACE-complete queries in lES(iSQr). We further show that every 
PSPACE query is in lES(iSQr) augmented with a deterministic transitive 
closure operator. Lastly, we consider ordered databases and provide a 
complete analysis of a hierarchy on IES(iSQ£) defined with respect to 
arity-bounded auxiliary relations. 



1 Introduction 

In the context of querying in a database system, for varied reasons such as effi- 
ciency and reliability, the user is often restricted to a special ambient language 
of that database system. For example, in commercial relational database sys- 
tems, the user is restricted to use SQL to express queries. These special query 
languages are usually not Turing-complete. Consequently, there are queries that 
they cannot express. For example, relational algebra cannot test if a given table 
has an even number of rows and SQL cannot produce the transitive closure of a 
table containing the edge relationships of an unordered graph. The preceeding 
discussion on query expressibility is based on the classical “static” setting, which 
assumes that the query must compute its answer from “scratch.” That is, the 
input to a query is given all at once and the output must be produced all at 
once. 

However, a database normally builds its tables over a period of time by a 
sequence of insertions and deletions of individual records. Therefore, it is rea- 
sonable to consider query expressibility in the following non-classical “dynamic” 
or “incremental” setting. The writer of the query knows in advance, before the 
database is built, which query he has to write. In such an environment, he 
can take into consideration and has access to the history of updates to the in- 
tended input tables of the query. What he has available to him at any moment 
is considerably more than the classical query writer. For example, in addition 
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to the current state of the input table, he would have access to the next in- 
coming update (the tuple being inserted or deleted), the current answer to the 
query (assuming that it is his plan to keep a copy of the answer), and possibly 
some auxiliary information (assuming that it is his plan to keep the auxiliary 
information). Following [8, 12, etc], we call this non-classical setting of querying 
databases “incremental query evaluation.” 

There are two kinds of incremental query evaluation in general. The first kind 
is where a query is definable in the ambient language. In this case, incremental 
evaluation is possible simply by re-executing the query from scratch every time 
an answer to the query is needed. The main challenge here is how to write 
the query in a smarter way to avoid re-executing the query from scratch all 
the time[12, 13, etc.] The second kind is where a query is not definable in the 
ambient language in the classical sense. Then the question arises as to whether 
this same query can be expressed in the non-classical sense, where we allow the 
query writer access to the extra incremental information mentioned earlier. This 
second kind of incremental query evaluation is the main interest of this paper. 
The main questions addressed in this setting deal with conditions under which 
it is possible to evaluate queries incrementally. 

Let us motivate this second kind of incremental query evaluation by a very 
simple example using the relational calculus (first-order logic) as the ambient 
language. Let parity be the query that returns true iff the cardinality of a set 
X is even. This query cannot be expressed in relational calculus, but it can be 
incrementally evaluated. Indeed, on the insertion of an x into X, one replaces 
the current answer to parity by its negation if a; ^ X, and keeps it intact if 
X £ X. On the deletion of an x from X, one negates the current answer if a; € X, 
and keeps the answer unchanged if a; ^ X. Clearly, this algorithm is first-order 
definable. 

We denote the class of queries that can be incrementally evaluated in a 
language £., using auxiliary relations of arity up to k, k > 0, by IES(T)i,. We let 
IES(T)£ be the class of queries incrementally evaluated in £, without using any 
auxiliary data (like the parity example above). Finally, lES(T) is the union of 
all \ES{C)k. 

The most frequently considered class is lES(iFC)), which uses the relational 
calculus as its ambient language. There are several examples of queries belonging 
to lES(iFC)) that are not definable in XO [21,7]. The most complex example is 
probably that of [9], which is a query that is in lES(iFC)) but cannot be expressed 
even in first-order logic enhanced with counting and transitive closure operators. 
It is known [7] that the arity hierarchy is strict: \ES{XO)k C \ES{XO)kJri^ 
that lES(iFC)) C PTIME. Still, for most queries of interest, such as the transitive 
closure of a relation, it remains open whether they belong to IES(iR9). It also 
appears [9] that proving lower bounds for IES(iR9) is as difficult as proving some 
circuit lower bounds. 

Most commercial database systems speak SQL and most practical imple- 
mentations of SQL are more expressive than the relational algebra because they 
have aggregate functions (e.g., AVG, TOTAL) and grouping constructs (GROUPBY, 
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HAVING). This motivated us [19] to look at incremental evaluation systems based 
on the “core” of SQL, which comprises relational calculus plus grouping and 
aggregation. Somewhat surprisingly, we discovered the following. First, queries 
such as the transitive closure and even some PTIME-complete queries, can be 
incrementally evaluated by core SQL queries (although the algorithms presented 
in [19] were quite ad hoc). Second, the arity hierarchy for core SQL collapses at 
the second level. 

Our goal here is to investigate deeper into the incremental evaluation capa- 
bilities of SQL-like languages. In particular, we want to find nice descriptions 
of classes of queries that can be incrementally evaluated. The first set of re- 
sults shows that the classes are indeed much larger than we suspected before. 
We define a language SQC that extends relational algebra with grouping and 
aggregation, and show that: 

1. Every query whose data complexity is in the polynomial hierarchy (equiva- 
lently: every second-order definable query) is in IES(<SQ£). 

2. There exists PSPACE-complete queries in lES(iSQL). 

3. Adding deterministic transitive closure to SQC (a DLOGSPACE operator) 
results in a language that can incrementally evaluate every query of PSPACE 
data complexity. 

In the second part of the paper, we compare the lES hierarchy in the cases of 
ordered and unordered types. We show that the IES(iSQ£)i, hierarchy collapses at 
level 1 in the case of ordered types. We further paint the complete picture of the 
relationship between the classes of the ordered and the unordered hierarchies; 
see Eigure 2. 

As one might expect, the reason for the enormous power of SQL-like lan- 
guages in terms of incremental evaluation is that one can create and maintain 
rather large structures on numbers and use them for coding queries. In some 
cases, this can be quite inefficient. However, we have demonstrated elsewhere [6] 
that coding an algorithm for incremental evaluation of transitive closure in SQL 
is reasonably simple. Moreover, it has also been shown [22] that the performance 
is adequate for a large class of graphs. Thus, while the proofs here in general 
do not lend themselves to efficient algorithms (nor can they, as we show how 
to evaluate presumably intractable queries) , the incremental techniques can well 
be used in practice. However, proving that certain queries cannot be incremen- 
tally evaluated in SQL within some complexity bounds appears beyond reach, 
as doing so would separate some complexity classes, cf. [15]. 

Organization In the next section, we give preliminary material, such as 
a theoretical language SQC capturing the grouping and aggregation features 
of SQL, the definition of incremental evaluation system lES, a nested relational 
language, and the relationship between the incremental evaluation systems based 
on the nested language and aggregation. 

In Section 3, we prove that IES(<SQ£), the incremental evaluation system 
based on core SQL, includes every query whose data complexity is in the poly- 
nomial hierarchy. We also give an example of a PSPACE-complete query which 
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belongs to IES(<SQ£), and show that SQC augmented with the deterministic tran- 
sitive closure operator can incrementally evaluate every query of PSPACE data 
complexity. 

In Section 4, we consider a slightly different version of SQC, denoted by 
SQC^ ■ In this language, base types come equipped with an order relation. We 
show that the IES(iSQ£^)i, hierarchy collapses at the first level, and explain the 
relationship between the classes in both IES(iSQ£)i, and IES(iSQL^)i, hierarchies. 

2 Preliminaries 

Languages SQ/L and Af'R.C A functional-style language that captures the es- 
sential features of SQL (grouping and aggregation) has been studied in a number 
of papers [18,5, 15] . While the syntax slightly varies, choosing any particular one 
will not affect our results, as the expressive power is the same. Here we work 
with the version presented in [15]. 

The language is defined as a suitable restriction of a nested language. The 
type system is given by 

Base := 6 | Q 

rt := Base x . . . x Base 
t :=M I rt I {rt} \ t x ... x t 

The base types are b and Q, with the domain of b being an infinite set U, 
disjoint from Q. We use x for product types; the semantics of x . . . x is 
the cartesian product of domains of types ti, . . . ,t„. The semantics of {t} is the 
finite powerset of elements of type t. We use the notation rt for record types, 
and let B be the Boolean type. 

A database schema <t is a collection of relation names and their types of the 
form {rt}. For a relation R £ a, we denote its type by tp^(i?). Expressions of 
the language over a fixed relational schema a are shown in Figure 1. We adopt 
the convention of omitting the explicit type superscripts in these expressions 
whenever they can be inferred from the context. We briefly explain the semantics 
here. The set of free variables of an expression e is defined in a standard way 
by induction on the structure of e and we often write e(xi, . . . ,a;„) to explicitly 
indicate that x\, ..., Xn are free variables of e. Expressions [J{ei j x £ 62 } and 
^{ei \ X £ 62 } bind the variable x (furthermore, x is not allowed to be free in 
62 for this expression to be well-formed). 

For each fixed schema a and an expression e(xi, . . . ,Xn), the value of 
e(xi,...,Xn) is defined by induction on the structure of e and with re- 
spect to a (T-database D and a substitution {x\'.= ai,...,Xn-= a«] that 
assigns to each variable Xi a value a, of the appropriate type. We write 
e[xi:= ai, . . . ,Xn-= an]{D) to denote this value; if the context is understood, 
we shorten this to e[xi:= ai, . . . ,Xn-= a«] or just e. We have equality test on 
both base types. On the rationals, we have the order and the usual arithmetic 
operations. There is the tupling operation (ei, . . . ,e„) and projections 7Tj,„ on 
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tuples. The value of {e} is the singleton set containing the value of e; ei U 62 
computes the union of two sets, and 0 is the empty set. 

To define the semantics of [J and assume that the value of 62 is the set 
{61, ... , bm}- Then the value of |J{ei I ^ ^ 62} is defined to be 

m 

IJ ei[a;i:=ai, . . . , a;„:=a„, a;:=6j](T>). 

The value of ^{ei | x € 62} is ci -I- ... -I- c^, each c, is the value of 

ei[xi. — Ui, . . . , Xn . — Ojn : — bi\^ i — 1 , . . . , Ttl. 




Fig. 1. Expressions of SQC over schema a 



Properties of SQC The relational part of the language (without arithmetic and 
aggregation) is known [18,3] to have essentially the power of the relational alge- 
bra. When the standard arithmetic and the ^ aggregate are added, the language 
becomes [18] powerful enough to code standard SQL aggregation features such as 
the GROUPBY and HAVING clauses, and aggregate functions such as TOTAL, COUNT, 
AVG, MIN, MAX, which are present in all commercial versions of SQL [1]. 

Another language that we frequently use is the nested relational calculus 
MTZC. Its type system is given by 

t := b\ B| t X . . . X t \ {t} 

That is, sets nested arbitrarily deep are allowed. The expressions of MTIC are 
exactly the expressions of SQC that do not involve arithmetic, except that there 
is no restriction to fiat types in the set operations. 
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Incremental evalnation systems The idea of an incremental evaluation sys- 
tem, or lES, is as follows. Suppose we have a query Q and a language C. An lES(T) 
for incrementally evaluating Q is a system consisting of an input database, an 
answer database, an optional auxiliary database, and a finite set of “update” 
functions that correspond to different kinds of permissible updates to the input 
database. These update functions take as input the corresponding update, the 
input database, the answer database, and the auxiliary database; and collectively 
produce as output the updated input database, the updated answer database, 
and the updated auxiliary database. There are two main requirements: the con- 
dition O = Q{I) must be maintained, where I is the input database, and O 
is the output database; and that the update functions must be expressible in 
the language C. For example, in the previous section we gave an incremental 
evaluation system for the parity query in relational calculus. That system did 
not use any auxiliary relations. 

Following [21,7,8,19], we consider here only queries that operate on rela- 
tional databases storing elements of the base type b. These queries are those 
whose inputs are of types of the form {b x ... x b}. Queries whose incremental 
evaluation we study have to be generic, that is, invariant under permutations of 
the domain U of type b. Examples include all queries definable in a variety of 
classical query languages, such as relational calculus, datalog, and the while-loop 
language. The criteria for permissible update are restricted to the insertion and 
deletion of a single tuple into an input relation. 

While the informal definition given above is sufficient for understanding the 
results of the paper, we give a formal definition of lES(T), as in [19], which is 
very similar to the definitions of FOIES [7] and Dyn-C [21]. Suppose the types 
of relations of the input database are {rti}, . . . , {rtm}, where rti, . . . , rtm are 
record types of the form b x . . . x b. We consider elementary updates of the 
form insi(x) and deli(x), where x is of type rti. Given an object X of type 
5 = {rti} X ... X {rtm}, applying such an update results in inserting x into or 
deleting x from the rth set in X, that is, the set of type {rti}. Given a sequence 
U of updates, U{X) denotes the result of applying the sequence U to an object 
X of type S. 

Given a query Q of type S ^ T (that is, an expression of type T with free 
variables of types {rti}, ... , {rtm}), and a type Taux (of auxiliary data), consider 
a collection of functions Xq: 



finit -S^T : 5 ^ Taux 

/jgl -.rtiXSxTx Taux T -.rtiXSxTx Taux Taux 

/4, : rti X 5 X T X Taux T ■. rtiX S xT X Taux Taux 

Given an elementary update u, we associate two functions with it. The function 
fu ■ 5xTxTaux ^ T is defined as X{X, Y, Z).f{^^^{a,X, Y, Z) if u is deli{a), and as 
X{X, Y, Z).f{^^{a, X, Y, Z) if u is insi{a). We similarly define /[)“ : 5xTxTaux 
Taux. 

Given a sequence of updates U = {u\, . . . ,ui}, define inductively the collec- 
tion of objects: Xq = 0 : S,RESq = finit(Xo), AUXq = ftnit(^o) (where 0 of 
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type 5 is a product of m empty sets), and 



Xi+i — Ui+i{Xi) 

RESi+i = U,+,{Xi,RESi,AUXi) 

AUXi+, = f^J^^{Xi,RESi,AUXi) 

Finally, we define Tq{U) as RESi. 

We now say that there exists an incremental evaluation system for Q in £ 
if there is a type Taux and a collection Tq of functions, typed as above, such 
that, for any sequence U of updates, Eq{U) = Q{U{%)). We also say then that 
Q is expressible in IES(/1) or maintainable in £. If Taux is a product of fiat types 
{rt}, with rts having at most k components, then we say that Q is in IES(T)i,. 

Since every expression in MTIC or SQC has a well-typed function associated 
with it, the definition above applies to these languages. 

Properties of lES Clearly, every query expressible in £, belongs to IES(T)e. What 
makes lES interesting is that many queries that are not expressible in £, can still 
be incrementally evaluated in C. For example, the transitive closure of undirected 
graphs belongs to \ES{EO )2 [21,7]. One of the more remarkable facts about 
lES(TCl), mentioned already in the introduction, is that the arity hierarchy is 
strict: \ES{TO)k C \ES{TO)k+i [7]. Also, every query in lES(TCl) has PTIME 
data complexity. 

A number of results about lES(iSQT) exist in the literature. We know [4] that 
SQC is unable to maintain transitive closure of arbitrary graphs without using 
auxiliary relations. We also know that transitive closure of arbitrary graphs 
remains unmaintainable in SQC even in the presence of auxiliary data whose 
degrees are bounded by a constant [5]. On the positive side, we know that if 
the bounded degree constraint on auxiliary data is removed, transitive closure of 
arbitrary graphs becomes maintainable in SQC- In fact, this query and even the 
alternating path query belong to IES(<SQ£) 2 . Einally, we also know [19] that the 
IES(iSQ£)i; hierarchy collapses to IES(<SQ£) 2 . We shall use the following result 
[19] several times in this paper. 

Fact 1 IES(7V7^C) C \ES{SQC). □ 

3 Maintainability of Second Order Queries 

We prove in this section that we can incrementally evaluate all queries whose 
data complexity is in the polynomial hierarchy PHIER (equivalently, all queries 
expressible in second order logic). The proof, sketched at the end of the section, 
is based on the ability to maintain very large sets using arithmetic, which suffices 
to model second-order expressible queries. 

Theorem 1. SQC can incrementally evaluate all queries whose data complexity 
is in the polynomial hierarchy. That is, PEflER C IES(iSQ£). □ 
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The best previously known [19] positive result on the limit of incremental evalu- 
ation in SQC was for a PTIME-complete query. Theorem 1 shows that the class 
of queries that can be incrementally evaluated in SQJl. is presumably much larger 
than the class of tractable queries. In particular, every NP-complete problem is 

in \ES(SQC). 

The next question is whether the containment can be replaced by equality. 
This appears unlikely in view of the following. 

Proposition 1. There exists a problem eomplete for PSP ACE whieh belongs to 

IES(5Qt:). □ 

Note that this is not sufficient to conclude the containment of PSPACE in 
IES(iSQ£), as the notion of reduction for dynamic complexity classes is more 
restrictive than the usual reduction notions in complexity theory, see [21]. In fact, 
we do not know if PSPACE is contained in IES(<SQ£). We can show, however, that 
a mild extension of SQC gives us a language powerful enough to incrementally 
evaluate all PSPACE queries. Namely, consider the following addition to the 
language: 

e : {rt x rt} 
dte(e) : {rt X rt} 

Here dte is the deterministic transitive closure operator [16]. Given a graph with 
the set of edges E, there is an edge (a,b) in its deterministic transitive closure 
iff there is a deterministic path (a, oi), (ai , 02 ), ..., (on-i > an), (an,b) in E] that 
is, a path in which every node a,, i < n, and a have outdegree 1. It is known 
[16] that dte is complete for DLOGSPACE. We prove the following new result. 

Proposition 2. SQC + dte ean inerementally evaluate all queries of 
PSPACE data eomplexity. That is, PSPACE C IES(iSQ£ + dte). □ 

We now sketch the proofs of these results. We use the notation p{B^) to 
mean the powerset of the A:-fold cartesian product of the set B : {b} of atomic 
objects. The proof of Theorem 1 involves two steps. In the first step, we show 
that p{B^) can be maintained in AfTiC for every k, when B is updated. In the 
second step, we show that if the domain of each second order quantifier is made 
available to AfTZC, then any second order logic formula can be translated to 
Mize. The first of these two steps is also needed for the proof of Propositions 2 
and 1, so we abstract it out in the following lemma. 

Lemma 1. AfTZC ean inerementally evaluate p{B^) for every k when B : {6} 
is updated. 

Proof sketeh. Let PB'f, and PB}{ be the symbols naming the nested relation 
p{B^) immediately before and after the update. We proceed by induction on k. 
The simple base case of A: = 1 (maintaining the powerset of a unary relation) is 
omitted. Eor the induction case of A: > 1, we consider two cases. 

Suppose the update is the insertion of a new element x into the 
set B. By the induction hypothesis, MlZC can maintain So we 

can create the following nested sets: Yq = {{(a;, . . . , a;)}} and 1) = 
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I ( 2 : 1 , . . . £ X} \ X £ for i = 1, 

k — 1. Let cartprod be the function that forms the cartesian product of two 
sets; this function is easily definable in J^TZC. Let allunion be the function that 
takes a tuple (5i, St) of sets and returns a set of sets containing all possible 
unions of 5i, St] this function is also definable in MTZC because the num- 
ber of combinations is fixed once k is given. Then it is not difficult to see that 
PB^ = {X I Y £ (PB^ cartprod Yq cartprod Yi cartprod • • • cartprodY^-i) ,X £ 
allunion(Y)}. 

Suppose the update is the deletion of an existing element x from the set B. 
Then all we need is to delete from each of PB\, PBf. all the sets that have 
a; as a component of one of their elements, which is definable in J^TZC. □ 



Proof sketch of Theorem 1. Let Q : {rt} be a query in PHIER, with input rela- 
tions i?i, . . . ,Rm of types {rt,}. Then Q is definable by a second-order formula 
with n free first-order variables, where n is the arity of rt. Suppose this formula 
is (f){x) = Qi5i . . . QpSpa{x, 5i , . . . , Sp); where a is a first-order formula in the 
language of i?,s, 5,s, and equality; Qs are the quantifiers V and 3; and each 5, 
has arity ki. Then, to maintain Q in AfTZC, we have to maintain: (a) the active 
domain B of the database i?i, . . . , Rm, and (b) all p{B^'). Note that the defi- 
nition of IES(A/’7?.C) puts no restriction on types of auxiliary relations. Since a 
single insertion into or deletion from a relation i?, results in a fixed number of 
insertions and deletions in B that is bounded by the maximal arity of a rela- 
tion, we conclude from Lemma 1 that all p{B^' ) can be incrementally evaluated. 
Since MRjC has all the power of first-order logic [3] , we conclude that it can in- 
crementally evaluate Q by maintaining all the powersets and then evaluating a 
first-order query on them. □ 

Proof sketch of Proposition 1. It is not hard to show that with p{B^), one can 
incrementally evaluate the reachable deadlock problem, which is known to 
be PSPACE-complete [20]. 

Proof sketch of Proposition 2. Let Q be a PSPACE query. It is known then that 
Q is expressible in partial-fixpoint logic, if the underlying structure is ordered. 
We know [19] that an order relation on the active domain can be maintained 
in SQC. We also know [2] that Q is of the form PFPy^s(t>{x,y,S), where ^ is a 
first-order formula. To show that Q is in IES(<SQ£-|- dtc) we do the following. We 
maintain the active domain B, an order relation on it, and p{B^) where k =\y\. 
We maintain it, however, as a flat relation of type {Qxbx . . .xb} where subsets 
are coded; that is, a tuple (c, a) indicates that a belongs to a subset of B^ coded 
by c. That this can be done, follows from the proof of \ES{AfTZC) C IES(<SQ£) 
in [19]. We next define a binary relation i?o of type {Q x Q} such that a pair 
(ci,C 2 ) is in it if applying the operator defined by f to the subset of B^ coded 
by Cl yields C 2 . It is routine to verify that this is definable. Next, we note that 
the outdegree of every node of Rq is at most 1; hence, dtc{Ro) is its transitive 
closure. Using this, we can determine the value of the partial fixpoint operator. 

□ 
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Limitations of Incremental Evaluation in SQP, Having captured the whole 
of the polynomial hierarchy inside IES(<SQ£), can we do more? Proving lower 
bounds in the area of dynamic complexity is very hard [21,9] and SQP, is 
apparently no exception. Still, we can establish some easy limitations. More 
precisely, we address the following question. We saw that the powerset of 
can be incrementally evaluated in J^TZC. Does this continue to hold for iter- 
ated powerset constructions? For example, can we maintain sets like p(p(H*)), 
p{p{B) eartprod p{B)), etc.? If we could maintain p(p(B*)) in MTZC, it would 
have shown that PSPACE is contained in IES(<SQ£). However, it turns out the 
Lemma 1 is close to the limit. First, we note the 2-DEXPSPACE data complexity 
of \ES(SQC). 

Proposition 3. For every query in IES(iSQ£) (even without restrietion to flat 
types) there exist numbers c,d > 0 sueh that the total size of the input database, 
answer database, and auxiliary database after n updates is at most . 

Proof. It is known that SQC queries have PTIME data complexity [18]. Thus, if 
f(n) is the size of the input, output and auxiliary databases after n updates, we 
obtain /(n + 1) < Cf{n)^ for appropriately chosen C, to > 0. The claim now 
follows by induction on n. □ 

We use pP{B^) to mean taking the powerset j times on the A:-fold cartesian 
product of the set B of atomic objects. We know that p{B^) can be maintained 
by MTIjC. For the iterated case, not much can be done. 

Corollary 1. Let j > 1. pP {B^) ean be maintained by MTZC when B is updated 
iff j = 2 and k = 1. 

Proof sketeh. First, we show that p‘^{B) can be maintained. Let B : {b} denote 
the input database. Let PPB = p{p{B)) : {{{6}}} denote the answer database. 
B is initially empty. PPB is initially {{}, {{}}}. Suppose the update is the 
insertion of a new atomic object x into B. Let A = {UU {{a;} U v \ v £ V} \ U £ 
PPB° , V £ PPB°}. Then PPB"^ = PPB° U Zi is the desired double powerset. 
Suppose the update is the deletion of an old object x from B. Then we simply 
delete from PPB all those sets that mention x. Both operations are definable in 
MTZC. 

That pP{B^) cannot be maintained for {j,k) ^ (2, 1), easily follows from the 
bounds above, as 2^ is not majorized by for any constants c, d. □ 

4 Low Levels of the lES hierarchy 

We know that the class of queries that can be evaluated incrementally in SQC is 
very large. We also know from earlier work [4, 19] that with restrictions on the 
class of auxiliary relations, even many PTIME queries cannot be maintained. 
Thus, we would like to investigate the low levels of the IES(<SQ£) hierarchy. This 
was partly done in [19], under a severe restriction that only elements of base 
types be used in auxiliary relations. Now, using recent results on the expressive 
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power of SQL-like languages and locality tools from finite-model theory [14, 
15], we paint the complete picture of the relationship between the levels of the 
hierarchy. 

In many incremental algorithms, the presence of an order is essential. While 
having an order on the base type b makes no difference if binary auxiliary rela- 
tions are allowed (since one can maintain an order as an auxiliary relation) , there 
is a difference for the case when restrictions on the arity of auxiliary relations 
are imposed. We thus consider an extension of SQC denoted by SQC^ which is 
obtained by a adding a new rule 

ei ,62 -.h 

<b (61,62) : B 

where <;> is interpreted as an order on the domain of the base type b. The main 
result now relates the levels of the IES(iSQL)i, and IES(iSQ£^)i, hierarchies. 

Theorem 2. The relationships shown in the diagram in Figure 2 hold. 

Here A c ► B means that A is a proper subset of B, and 

A ► B means that A ^ B and B ^ A. 



SQC< ^ IES(5QC<), cl IES(5QC<)i = IES(5QC<)a,>i D 



V. 


V. 






■••9 10 


■•.11 12 


14 




" 5 


L., 6 '1.. 


k.. 7 


8 



U 5 u g u y II 

SQC c IES(5QC), IES(5QC)i c IES(5QC)2 



PHIER 



IES(5QC)a,>2 



Fig. 2. IES(50 C)a; and IES(50 C^)a, hierarchies 



Proof sketeh. The containment 13 was shown in this paper (Theorem 1). The 
hierarchy collapse 8, as well as the inclusion 6 and the maintenance of order 14 
are from [19]. We also note that in SQC, one can incrementally evaluate a query 
go such that qo{D) = 2", where n is the size of the active domain of D. However, 
it is known that the maximal number SQC or SQC^ can produce is at most 
polynomial in the size of the active domain and the maximal number stored in 
the database. This shows inclusions 2, 5 and half of 9: IES(<SQ£)e SQC^ . 

Next, consider an input of type {6}, and a query 

_ ( 2 ^^ if jWj is a power of 2 

I 0 otherwise 

This query belongs to IES(<SQ£)i, as we can maintain the set {0, 1,2,..., 21^1} 
and then use standard techniques to test for the powers of 2. However, gi ^ 
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IES(iSQ£^)£. Indeed, if |X |= 2™ — 1, then q\{X) = 0 and thus on an insert into 
X , the maintenance query would have to produce an integer exponential in the 
size of the input. This shows 3, 6, and half of 11: IES(<SQ£)i ^ IES(<SQ£^)e. 

The proof of collapse 4 proceeds similarly to the proof of 8 in [19]. To reduce 
arity 2 to arity 1, we maintain a large enough initial segment of natural numbers 
(but still polynomial) which we use to code tuples by numbers, where an element 
of base type b is coded by its relative position in the ordering of the active domain, 
and tuples are coded using the standard pairing function. Then 4 and 7 imply 
12 . 

For the remaining relationship, we use locality techniques from finite-model 
theory [10,11,14]. We shall now consider queries on tuples of fiat relations 
of types {6 X ... X 6} into a relation of type of the same form. Given an input 
database D, which is a tuple of relations i?i , . . . , 7?^,, we define the Gaifman graph 
Q{D) on its active domain as an undirected graph with (a, h) being an edge in it 
if one of i?,s has a tuple that contains both a and b. By a distance in D, we mean 
the distance in its Gaifman graph. Given a tuple t, by S^{t) we mean the set of 
all elements of the active domain of T> at a distance at most r of some element 
of t. These are neighborhoods of tuples, which can be considered as databases of 
the same schema as T>, by restricting the relations of D onto them. Two tuples 
are said to have the same r-type if their r-neighbor hoods are isomorphic. That 
is, there is a bijection / : S^{t 2 ) such that /(fi) = ^2 and for every 

tuple u of elements of u £ Ri implies f{u) £ Ri, and for every v in 

S^(t 2 ), V £ Ri implies f~^{v) £ i?j. 

We now say (see [14], where connection with Gaifman’s theorem [11] is ex- 
plained) that a query Q is local if there exists an integer r such that, if and 
t 2 have the same r-type in D, then t\ £ Q{D) iff t 2 £ Q{D). We shall use the 
fact [15] that every query of pure relational type (no rationals) in SQC is local. 

Now 1 follows from locality of SQC, and the fact that SQC^ expresses all 
queries definable in first-order logic with counting over ordered structures (see 
[15]), which is known to violate locality [14]. For other relationships, consider 
the following query. Its input type is {b x b} x {6}; its output is of type {b}. 
We shall refer to the graph part of the input as G and to the set part as P] 
that is, the input is a pair (G, P). A pair is good if G is the graph of a successor 
relation, and P is its initial segment. A query q is good if it has the following 
properties whenever its input is good: (1) If n = 2l^, where n is the number of 
nodes in G, then q{G,P) is the transitive closure of the initial segment defined 
by P] (2) If n 2l^, then q{G,P) = 0. It can be shown that there is a good 
query q in SQC^ — this is because with counting power we can encode fragments 
of monadic second-order on small portions of the input [14] . 

As the next step, we show that no such good q can belong to IES(<SQ£)i . This 
shows the second half of 11 (that IES(<SQ£^)e ^ IES(<SQ£)i), 10, 12, and second 
half of 9. It also shows 7, because we know SQC^ C IES(<SQ£) 2 . To prove this, 
we first reduce the problem to inexpressibility of a good query in SQC in the 
presence of additional unary relations. This is because we can consider an input 
in which 2l^“^ = n. For such an input, the answer to g is 0, but on an insert 
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into P it becomes the transitive closure of the segment defined by P. As the 
next step, we show that locality of SQC withstands adding numerical relations, 
those of type {Q x . . . x Q}, as long as there is no ordering on b. To prove this, 
we first code SQC into an infinitary logic with counting, as was done in [15], and 
then modify the induction argument from [17] to prove locality in the presence 
of extra numerical relations. Finally, a finite number, say to, of unary relations of 
type {b}, amounts to coloring nodes of a graph with 2™ colors. If we assume that 
q is definable with auxiliary unary relations, we fix a number r witnessing its 
locality, and choose n big enough so that there would be two identically colored 
disjoint neighborhoods of points a and b in P. This would mean that the r-types 
of (a, b) and (6, a) are the same, but these tuples can clearly be distinguished by 
q. This completes the proof. □ 

5 Open Problems 

We have shown that PHIER C lES(iSQL), but it remains open whether a larger 
complexity class can be subsumed. One possibility is that all PSPACE queries 
are maintainable in SQC. While we showed that there is a PSPACE-complete 
problem in IES(<SQ£), this does not mean that all PSPACE queries are main- 
tainable, as lES in general is not closed under the usual reductions (polynomial 
or first-order), and we do not yet know of any problem complete for PSPACE 
under stronger reductions, defined in [21], that would belong to IES(<SQ£). 

The proof of PHIER C IES(<SQ£) does not lend itself to an efficient algorithm 
for queries of lower complexity. In fact, it is not clear if such algorithms exist 
in general, and proving, or disproving their existence, is closely tied to deep 
unresolved problems in complexity. However, coding the maintenance algorithms 
for some useful queries (e.g., the transitive closure) in SQL is quite easy [6] 
and in fact the maintenance is quite efficient for graphs of special form [22]. 
Thus, while general results in this area are probably beyond reach, one could 
consider restrictions on classes of inputs that would lead to efficient maintenance 
algorithms. 
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Abstract. We present a new optimization method for nested SQL query 
blocks with aggregation operators. The method is derived from the the- 
ory of dependency implication and tableau minimization. It unihes and 
generalizes previously proposed (seemingly unrelated) algorithms, and 
can incorporate general database dependencies given in the database 
schema. 

We apply our method to query blocks with MAX, MIN aggregation op- 
erators. We obtain an algorithm which does not infer arithmetical or 
aggregation constraints, and reduces optimization of such query blocks 
to the well-studied problem of tableau minimization. We prove a com- 
pleteness result for this algorithm: if two MAX, MIN blocks can be merged, 
the algorithm will detect this fact. 



1 Introduction 

The practical importance of optimizing queries in relational database systems 
has been recognized. Traditional systems optimize a given query by choosing 
among a set of execution plans, which include the possible orders of joins, the 
available join algorithms, and the data access methods that are used [SAC-l-79, 
JK84]. Such optimizers work well for the basic SELECT-FROM-WHERE queries of 
SQL [MS93]. However, they can perform poorly on nested SQL queries, which 
may include subqueries and views. 

Since nesting of queries is a salient feature of the SQL language as used 
in practice, optimization of such queries was considered early on. One line of 
research has concentrated on extending the traditional “selection propagation” 
techniques to nested queries. In these approaches, traditional optimizers are 
enhanced with additional execution plans, where selection and join predicates are 
applied as early as possible [MFPR90a, MFPR90b, MPR90, LMS94]. Another 
line of work has proceeded in an orthogonal direction, introducing execution 
plans which correspond to alternative structures of nesting. In particular, these 
approaches consider the possibilities of merging query blocks, denesting queries, 
commuting aggregation blocks with joins, and commuting GROUP BY with join 
[Day87, GW87, Kim82, Mur92, PHH92, YL94, HG94]. 

In this paper we propose an approach which unifies and generalizes the ap- 
proaches mentioned above. We apply the “selection propagation” idea to certain 
data dependencies that are implicit in aggregation blocks. Propagation of SQL 
predicates [MFPR90a, MFPR90b, MPR90, LMS94] is a special case of propaga- 
tion of these dependencies. At the same time, propagating these dependencies 
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can produce execution plans with alternative nesting structure, as in [Day87, 
GW87, Kim82, Mur92, PHH92, YL94, HG94], 

In addition to expressing in a common framework previously proposed query 
transformations which seemed unrelated, our approach incorporates naturally 
general data dependencies that may be given in the database schema. It extends 
transformations which commute joins with aggregation operators (or GROUP by) 
and merge query blocks [Day87, PHH92, YL94, HG94], in that it does not require 
adding tuple ids to the grouping addributes; and it can handle joins on aggre- 
gation attributes as well as on grouping attributes. Also, transformations which 
denest subqueries [GW87, Kim82, Mur92] only consider query blocks nested 
within each other, whereas our method does not depend on the order of nesting. 

We illustrate our method by means of a small example. We consider the 
following database schema (of a hypothetical university database)^: 

ids (Name, Idnum) 
enrolled(Name, Idnum, Gourse) 
timetable(Gourse, Hours) 

The relation ids records the id numbers of students. The relation enrolled 
records the courses a student is enrolled in (and his/her id number); timetable 
records the number of hours a course is taught per week. These base relations 
do not contain duplicates. 

The following dependencies are given in the database schema: 

1. enrolled. Name, Idnum C ids. Name, Idnum 

2. ids: Name Idnum 

The hrst is an inclusion dependency (IND) stating that: every pair consisting 
of a student name and id number that appears in the enrolled relation, also 
appears in the ids relation. The second is a functional dependency (FD) stating 
that: student name is a key of the ids relation. 

In Figure 1 we show a SQL dehnition for a view maxhours and a nested query 

Q. 

The view maxhours gives, for each student, his id number; and the maximum 
number of hours of teaching (per week) of any of the courses he is enrolled in. 
The view maxhours is used to dehne the nested query Q, which gives, for each 
student, his id number and the maximum number of hours of teaching (per 
week) of any of the courses he is enrolled in; provided that there exist at least 
two courses which are taught for at least that many hours per week. 

In Figure 2 we show the result of applying our optimization method to the 
nested query Q. 

The query Q’ results by transforming Q in the following ways. 

First, the join with the ids relation in the main block of Q is eliminated. This 
simplihcation is arrived at using the aggregation block of the view maxhours, 

^ We use a non-normalized schema for brevity; the optimization method works for 
normalized and for non-normalized schemas. 
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and the dependencies of the schema. To justify the simplihcation we reason 
informally as follows. The join with the enrolled relation in maxhours is more 
restrictive than the join with the ids relation in the main block of Q, because 
of the IND 1. Also, the FD Name— ^Idnum can be seen to hold for the enrolled 
relation, because of the FD 2 and the IND 1; consequently, the value of the 
Idnum attribute of Q can be taken from the Idnum attribute of the enrolled 
relation, and thus from the Idnum attribute of maxhours (instead of the Idnum 
attribute of the ids relation). 

The second optimization of Q is that the subquery in the WHERE clause has 
been replaced by a view countcourses, which gives, for each number of hours 
some course is taught for, the number of courses that are taught for at least 
that many hours per week. Note that the common nested iteration method of 
evaluating the subquery in Q requires retrieving the timetable relation once 
for each tuple of the view maxhours referenced in the main block of Q. On 
the other hand, Q’ can be evaluated by single-level joins containing the join 
relations explicitly; this enables the optimizer to use a method such as merge 
join [SAC-l-79] to implement the joins, often at a great reduction cost over the 
nested iteration method [Kim82]^. 

Observe also that the view countcourses contains the joins with the enrolled 
and timetable relations, appearing in the view maxhours. Including these joins 
makes the view countcourses safe, and produces a potentially cheaper execu- 
tion plan, as it reduces the number of groups to be aggregated. 

Optimization algorithms for nested SQL queries are often described as al- 
gebraic transformations, operating on a query graph which captures the rele- 
vant information in the query [MFPRQOa, MFPRQOb, MPR90, LMS94, Day87, 
GW87, Kim82, Mur92, PHH92, YL94, HG94]. In our method, we use the alter- 
native tableau formalism that has been introduced in the context of conjunctive 
queries [AHV95, U1189]. In Section 2 we sketch how this formalism is used to 
describe nested SQL queries. 

In Section 3 we describe our optimization method; we use the chase pro- 
cedure and the concept of tableau equivalence, which have been introduced for 
optimizing conjunctive queries in the presence of general data dependencies. One 
importance difference of SQL queries from conjunctive queries is the presence 
of duplicates in the result of a typical SQL query [IR95, GV93]. Our method 
optimizes correctly SQL queries where the number of duplicates is part of the 
semantics, and should not be altered by optimization^. 

We also describe in Section 3 how to hue-tune our method for the case of 
SQL queries with MAX, MIN operators. We obtain in this case an optimization 
algorithm which does not infer any arithmetical or aggregation constraints. 

In Section 4 we focus on the special case of merging of SQL query blocks 
with MAX, MIN operators. We show that, if such merging is possible, it will 
be discovered by our optimization method. Such completeness results can not 

^ Detailed cost models illustrating the gain in complexity can be found in [Kim82, 
GW87]. 

^ The number of duplicates is irrelevant to the semantics of our example query Q. 
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hold for algebraic transformations of SQL queries: designing complete systems 
of algebraic transformations requires rather technical devices, having to do with 
the equality predicate [YP82, IL84, C87]. 

In Section 5 we summarize our contributions, and point out some directions 
for further research. 



ids(Name, Idnum) 
enrolled(Name, Idnum, Course) 
timetable(Course, Hours) 

1. enrolled. Name, Idnum C ids. Name, Idnum 

2. ids: Name Idnum 

V: CREATE VIEW maxhours(Name, Idnum, Hours) AS 
SELECT e.Name, e. Idnum, MAx(t. Hours) 
FROM enrolled e, timetable t 
WHERE e. Course = t. Course 
GROUPBY e.Name, e. Idnum 

Q: SELECT i.Name, i. Idnum, m. Hours 
FROM ids i, maxhours m 
WHERE m.Name = i.Name AND 

2 < ( SELECT COUNT (u. Course) 

FROM timetable u 
WHERE u. Hours > m. Hours ) 



Fig. 1. Example database schema and query 



Q’: SELECT m.Name, m. Idnum, m. Hours 
FROM maxhours m, countcourses k 
WHERE m. Hours = k. Hours AND 
2 < k. Count 

W: CREATE VIEW countcourses(Hours, Count) AS 
SELECT t. Hours, COUNt(u. Course) 

FROM enrolled e, timetable t, timetable u 
WHERE e. Course = t. Course AND 
u. Hours > t. Hours 
GROUPBY t. Hours 



Fig. 2. Optimized example query 




Optimization of Nested SQL Queries by Tableau Equivalence 35 



2 SQL Queries as Tableaux 

Tableaux are a declarative formalism which captures the SELECT-PROJECT- JOIN 
queries of the relational calculus [AHV95, U1189]. It is well-known that tableaux 
can express the basic SELECT-EROM-WHERE queries of SQL. In this Section we 
describe (by example) a natural extension of tableaux which expresses SQL 
queries with nested blocks and aggregation operators. The tableaux we describe 
in this Extended Abstract express existential SQL queries, i.e., conditions which 
have to hold for some tuples in the database. Queries with universal conditions 

- OUTER JOIN, null values, ALL quantifiers - are not expressible. 

For each query block we construct one tableau; subqueries or views within a 
query become separate tableaux. Figure 3 shows the tableaux for our example 
query in Figure 1. 

A typical row of a tableau has the form R(x, y, ...), where R is the name of a 
base relation, a SQL predicate or a query block; and x, y, ... are variables local 
to the tableau, or constants. 

The first row of a tableau gives the general form of a tuple in the result of 
the corresponding query block; it is called the summary row, and the variables 
it contains are called distinguished. 

The subsequent rows of the tableau give the general form of the tuples that 
have to be present in the base relations, and in the results of other query blocks; 
they typically contain additional variables, called nondistinguished. 

Thus, for the tableau corresponding to the view maxhours the summary row 
is maxhours (n, p, hmax). The tuple (n, p, hmax) will be in the result of maxhours 
just in case the relation enrolled contains some tuple (n, p, c); and the relation 
timetable contains some tuple (c, h). Notice that c, h are nondistinguished 
variables. The last line of the tableau expresses aggregation and grouping: it 
states that, for each fixed n and p, hmax is the maximum possible value of h. A 
similar formulation of aggregation is described in [Klug82]. 

A tableau corresponding to a subquery contains special non-local variables 

- they are local to the tableau obtained from the enclosing query block. Thus, 
the tableau corresponding to the subquery in Q, Qsubguery, contains a non-local 
variable H, which is local to the tableau corresponding to Q. 

It is straightforward (but lenghty) to give an algorithm which will convert a 
SQL query to a tableau representation; and vice versa. 

3 The Optimization Method 

Optimization of tableaux (corresponding to conjunctive queries) has been stud- 
ied extensively. The central notion is eguivalence, i.e., finding a tableau which 
expresses the same query and can be evaluated more efficiently. The chase pro- 
cedure is a general method to test equivalence of tableaux, in the presence of 
data dependencies [AHV95, U1189]. 

Our method introduces, for each query tableau, an embedded implicational 
dependency (FID) [AHV95, F82] stating that certain tuples exist and certain 
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Fig. 3. Tableaux for example query 



predicates hold in the database. In general, we can obtain such an EID by simply 
replicating the tableau. 

Each query tableau is subsequently optimized using the dependencies of the 
schema and the EIDs introduced. The algorithm executes two passes (as in 
[LMS94]): 

The Rrst pass proceeds in a bottom-up way. Each tableau is optimized using 
the EIDs of the tableaux it contains. We start from the tableau which contain 
no subqueries or views, and finish with the top-level tableau. 
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In the second pass, each tableau is optimized using the EIDs of the tableaux 
it is contained in, in a top-down way. 

In each pass, the optimization of each tableau consists of two distinct oper- 
ations: 

The hrst operation is to introduce new predicates; and to simplify the joins, 
by eliminating rows of the tableau. 

The second operation is to replace subqueries by views (cf. the Introduction); 
it is done only during the second pass. 

We illustrate the two operations by means of our running example. 

Figure 4 shows the FID obtained from the view maxhours. It states that, for 
each tuple (n, p, hmax) in the result of maxhours, the relation enrolled contains 
a tuple (n, p, c); and the relation timetable contains a tuple (c, hmax), for some 
c. Notice that the EID is simpler than the tableau of maxhours. Such simplihed 
EIDs can be used for query blocks with the MAX, MIN aggregation operators. 

Introduction of new predicates and simplihcation of joins are done as follows. 

The tableau is chased with the appropriate EIDs, and the dependencies of 
the schema. Figure 5 shows (in part) the result of applying this procedure to the 
tableau for Q. New rows are added to the tableau; they appear after the triple 
line. Chasing the second row of the original tableau with the EID obtained from 
maxhours, adds the hrst two of the new rows. Chasing the hrst of the new rows 
with the IND 1 of the schema adds the third new row. 

The chase also adds to the tableau the SQL predicates appearing in the 
EIDs. In the case of the equality predicate, variables in the tableau are equated. 
In Figure 5, such equating happens by applying the ED 2 of the schema to the 
hrst and last rows; this equates p’ with p. 

To simplify the joins, the tableau resulting from the chase is minimized. This 
is done by examining the rows of the original tableau not used m the chase, and 
eliminating those which are covered by the tuples introduced by the chase. 

Remark. It is not necessary for the chase itself to terminate; the tableau can 
still be minimized, as soon as a row as above is discovered. 

Thus, the hrst row of the tableau in Figure 5 can be eliminated, because it 
is duplicated in the last row (recall that p’ has been equated with p). 

The Rnal optimized tableau is obtained by dropping the rows that were in- 
troduced by the chase. In our example, this gives the tableau for Q’ in Figure 

6 . 

Remark. If the number of duplicates is part of the semantics of a query block, 
minimization of the corresponding tableau is omitted. 

Replacement of subqueries by views is done as follows. 

The non-local variables of the tableau corresponding to a subquery are traced 
to the tableaux they are local to. The tuples containing those variables as local, 
are added to the subquery tableau. The resulting tableau is optimized using 
chase, as in the first operation. 
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Applying this operation to the tableau Qsubquery in Figure 3 (where H is a 
non-local variable) results in the tableau countcourses in Figure 6. 

The correctness of our method is expressed in the following result. 



Theorem 1. Suppose a query Q’ is obtained by optimizinq a query Q, 

(i) On every database, the result o/Q’ contains exactly the same tuples as 
the result o/Q, 

(ii) If minimization is not used, each tuple is duplicated in the result o/Q’ 
the same number of times as in the result of Q. 

(ill) If minimization is used, each tuple is duplicated in the result o/Q’ at 
most as many times as in the result of Q. 



The argument is a straightforward application of the properties of tableau 
chase and minimization, and of the results of [IR95, CV93]. 
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Fig. 4. EID from the view maxhours 
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Fig. 5. Chase on the tableau of Q 
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Fig. 6. Tableaux for optimized example query 



4 Completeness for Merging MAX, MIN Aggregation 
Blocks 

It is not hard to see that nested SQL query blocks without aggregation can 
be merged. This is the Type-N and Type-J nesting considered in [Kim82]. In 
addition, Our optimization method can merge query blocks where MAX, MIN 
operators are used in the inner block. 

An example of such merging is shown in Figure 7; our running example is 
varied by omitting the last conjunct of the WHERE clause of Q, to obtain Qo- The 
optimized block is Qq: essentially, Qo has been merged with the view maxhours. 

There are cases where merging of MAX (min) query blocks can be shown to 
be impossible. Consider again the query Q in our example. It is not hard to 
see that, by adding appropriately chosen tuples to the base relations, we can 
change the result of Q to empty, consider the semantics of the last conjunct of 
the WHERE clause of Q. In contrast, this cannot happen for Qq, or its equivalent 

Qo- 
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Definition 2. A query is simple if its result cannot be changed to empty by 
adding tuples to the database relations. 

Propositions. A SQL query defined by a single MAX block is simple. 

An analogous Proposition holds for SQL queries defined by a single MIN 
block. 

By the above remarks, SQL query blocks cannot be merged into a single MAX 
block, unless the query defined is simple. 

We can now state our completeness result. 

Theorem 4. If a SQL query is simple, the optimization method transforms it 
into a single MAX block. 

The proof uses the properties of the chase to construct a database which 
demonstrates that the query is not simple (if the optimization method cannot 
transform the query into a single MAX block). 

An analogous result holds for transforming SQL queries into a single MIN 
block. 



Qo: SELECT i.Name, i.ldnum, m. Hours 
FROM ids i, maxhours m 
WHERE 111. Name = i.Name 

Q'q: select m.Name, m.ldnum, m. Hours 
FROM maxhours m 



Fig. 7. Example of merging aggregation blocks 



5 Conclusions 

We have presented a general optimization method for nested SQL queries, which 
unifies several known approaches and at the same time extends them in several 
nontrivial ways. We have applied our method to the case of query blocks with 
MAX, MIN aggregation operators. For such queries, we have obtained an algo- 
rithm which avoids the complications of inferring arithmetical or aggregations 
constraints [SRSS94, NSS98]; thus, it becomes possible to use algorithms for op- 
timizing queries without constraints [DBS90, CR97, ASU79a, ASU79b, JKlug84, 
CM77] to optimize nested SQL query blocks with MAX, MIN. 

We believe our approach will be fruitfully applicable in other cases. A natural 
proposal is to apply it to aggregation operators which are known to be delicate 
to analyze, such as COUNT [Kim82, GW87, Mur92]. 

Finally, it should be possible to extend our approach to incorporate other 
optimization algorithms [RR98, S.et.al.96, SPL96] within our general framework. 
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Abstract. User-defined aggregates (UDAs) can be the linchpin of so- 
phisticated data mining functions and other advanced database applica- 
tions, but they find little support in current database systems. In this 
paper, we describe the SQL-AG prototype that overcomes these limita- 
tions by supporting UDAs as originally proposed in Postgres and SQL3. 
Then we extend the power and flexibility of UDAs by adding (i) early 
returns, (to express online aggregation) and (ii) syntactically recogniz- 
able monotonic UDAs that can be used in recursive queries to support 
applications, such as Bill of Materials (BoM) and greedy algorithms for 
graph optimization, that cannot be expressed under stratified aggrega- 
tion. This paper proposes a nnified solution to both the theoretical and 
practical problems of UDAs, and demonstrates the power of UDAs in 
dealing with advanced database applications. 



1 Introduction 

The importance of new specialized aggregates in advanced applications is ex- 
emplified by rollups and data cubes that, owing to their use in decision sup- 
port applications, have been included in all new releases of commercial DBMSs. 
Yet, we claim that database vendors, and to a certain extent even researchers, 
have overlooked User-Defined Aggregates (UDAs) , which can play an even more 
critical and pervasive role in advanced database applications, particularly data 
mining. In this paper, we show that: 

o Many data mining algorithms rely on specialized aggregates, 
o The number and diversity of these aggregates imply that (rather than ven- 
dors adding ad hoc built-ins, which are never enough) a general mechanism 
should be provided to introduce new UDAs, in analogy to user-defined scalar 
functions of object-relational (0-R) DBMSs, 
o UDAs can be easily and efficiently incorporated in 0-R DBMSs, in accor- 
dance with the UDA specs originally proposed in SQL3 [8] . This is also true 
for the UDA extensions discussed in this paper that greatly improve their 
flexibility and functionality. 
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2 Aggregates in Data Mining 

As a first example, consider the data mining methods used for classification. Say, 
for instance, that we want to classify the value of PlayTennis as a ‘Yes’ or a ‘No’ 
given a training set such as that shown in Table 1. 



Table 1. Tennis 
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The algorithm known as Boosted Bayesian Classifier [5] has proven to be the 
most effective at this task (in fact, it was the winner of the KDD’97 data mining 
competition) . A Naive Bayesian [5] classifier makes probability-based predictions 
as follows. Let Ai, A 2 , . . . , A^ be attributes, with discrete values, used to predict 
a discrete class C. (For the example at hand, we have four prediction attributes, 
fc = 4, and C = 'PlayTennis'). For attribute values ai through afc, the optimal 
prediction is the value c for which Pr{C = c|Ai = oi A . . . A = Ofe) is 
maximal. By Bayes’ rule, and assuming independence of the attributes, this 
means to classify a new tuple to the value of c that maximizes the product of 
Pr{C = c) with: 



n = c) 

3=1,. ..,K 

But these probabilities can be estimated from the training set as follows: 



Pr{Aj = a j\C = c) 



count{Aj = Qj A C = c) 
count {C = c) 



The numerators and the denominators above can be easily computed using 
SQL aggregate queries. For instance, all the numerators values for the third 
column (the Wind column) can be computed as follows: 
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Example 1. Using SQL’s count Aggregate 

SELECT Wind, PlayTennis, count(*) 

FROM Tennis 

GROUP BY Wind, PlayTennis 

Furthermore, the Super Groups construct contained in the recent OLAP exten- 
sions of commercial SQL systems [3] allows us to express this computation in a 
single query: 

Example 2. Using DB2’s grouping sets 

SELECT Outlook, Temp, Humidity, Wind, 

PlayTennis, count(*) 

FROM Tennis 

GROUP BY GROUPING SETS (PlayTennis), 

((Outlook, PlayTennis), (Temp, PlayTennis), 

(Humidity, PlayTennis), (Wind, PlayTennis)) 



In conclusion, this award-winning classification algorithm can be imple- 
mented well using the SQL count aggregate, thanks to the multiple grouping 
extensions recently introduced to support OLAPs. A database-centric approach 
to data mining is often preferable to main-memory oriented implementations, 
because it ensures better scalability and performance on large training sets. Un- 
fortunately, unlike the Bayesian classifier just discussed, most data mining func- 
tions are prohibitively complex and inefficient to express and execute using the 
(SQL-compliant) data manipulation primitives of current database systems [23]. 
In this paper, we claim that the simplest and most cost-effective solution to this 
problem consists in adding powerful UDA capabilities to DBMSs. Toward this 
goal, we implemented the UDA specifications originally proposed for SQL3 [8], 
(but not supported yet in commercial systems) and extended them with the 
mechanism of early returns discussed in the next section. While we use mostly 
data mining examples, UDAs are needed in many applications to overcome the 
limited expressive power of SQL; for instance, we found them essential in imple- 
menting temporal database queries [4]. 



3 UDAs and Early Returns 

While the aggregate computations needed in a Bayesian classifier can be ex- 
pressed using SQL built-ins, this is not the case for most data mining algorithms. 
For instance the SPRINT classifier [24] chooses on which attribute and value to 
split next using a gini index: 



C 



gini{S) = 1 - 



( 1 ) 
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Here pj denotes the relative frequency of class j in the training set S. For discrete 
domains (i.e., categorical attributes) this operation can be implemented using the 
standard count aggregate of SQL. However, the attribute values from continuous 
domains must be first sorted on the attribute value, and then the count must be 
evaluated incrementally for each new value in the sorted set. Now, incremental 
evaluation of aggregates is not fully supported in current DBMSs (even those 
providing support for rollups) . Moreover, the objective of the gini computation is 
to select a point (and a column from the table) where the gini index is minimum. 
Thus, for each new value in the sorted set, (i) the running count for each class 
must be updated, and (ii) the value of the gini function at this point must be 
calculated and compared with the minimum so far, to see if the old value must 
be replaced with the new one; in fact, after every value has been examined, 
(iii) the minimum point for the gini must be returned, since this point will be 
used for the next split. Therefore, the gini computations involves the following 
aggregate-like operations: (i) computing a running count, (ii) composing two 
aggregates (via the intermediate gini function), and (iii) returning the point 
where the minimum is found (rather than the value of that minimum). None 
of these three operations can be easily expressed and efficiently supported in 
SQL2; but with UDAs originally proposed for SQL3 [8], they can be merged 
into a single and efficient computation that determines the splitting point in a 
single pass through the dataset. 

While UDAs such as those proposed for SQL3 [8] are the right tool for com- 
puting a gini index, they cannot express many other aggregate computations, 
and, in particular, they cannot express online aggregation [6]. On-line aggrega- 
tion is very useful in many situations, e.g., to stop as soon as the computation 
of an average converges within the desired accuracy, or when aggregates, such 
as count or sum, have crossed the minimum support level (e.g., in the A Priori 
algorithm) . On-line aggregates find many applications in data mining [26] , and 
greatly extend the power of UDAs. 

We can solve these problems by allowing UDAs to produce “early returns”, 
i.e., to return values during the computation, rather than only at the end of the 
computation as in traditional aggregates. The computation of rollups, running 
aggregates, moving window aggregates, and many others becomes simple and 
efficient using the mechanism of early returns, which allows the generation of 
partial results while the computation of the aggregate is still in progress [4]. 

For instance, while final returns can be used to find a point of global minimum 
for a function, such as the gini function, early returns will be used to compute 
the points where local extrema occur (i.e., the valleys and the peaks). 



4 Extended UDA and SQL3 

In this section, we discuss the SQL-AG language, whereas the the SQL-AG sys- 
tem is described in the next section. To introduce a UDA named myavg, accord- 
ing to the specifications proposed for SQL3 [8] , we must proceed as shown in Ex- 
ample 3. Basically, the user must define three user-defined functions (UDFs) for 
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the three cases INITIALIZE, ITERATE, and TERMINATE. The INITIALIZE (ITER- 
ATE) function defines how the first (successive) values in the set are processed. 
The TERMINATE function describes the final computation for the aggregate 
value. Thus, to compute the traditional average, the state will hold the variables 
sum and count; these are, respectively, initialized to the the first value of the set 
and to 1 by myavg_single. Then, for each successive value in the set, myavg.multi 
adds this value to sum and also increases count by 1. Finally, myavg.terminate 
returns sum/count. 

Example 3. A UDA Definition 

AGGREGATE FUNCTION myavg( IN NUMBER) 

RETURNS NUMBER 

STATE state 

INITIALIZE myavg_single 

ITERATE myavg_multi 

TERMINATE myavg_terminate 

The search for global minima for the gini index can be easily programmed 
using two UDFs gini-single and gini-multi. But, in the presence of ties, the gini- 
terminate function will return any of the points where the global minimum occurs, 
e.g., the first point. Therefore, the order in which the elements of a set are 
considered becomes important, and can influence the final result, and to the 
extent that this order is unknown, UDAs display a nondeterministic behavior. 
Traditional SQL built-ins are instead deterministic, i.e., they always return the 
same result on a given set. This nondeterministic behavior is not an impediment 
in formalizing the logic-based semantics of UDAs, and in writing effective queries; 
in fact, nondetermism is a critical feature in many real life applications. 

An important extension introduced by SQL-AG is early returns that are 
specified using a PRODUCE myavg.produce function. For instance, with an online 
aggregation, the average of values computed so far can be returned every N 
records, where N is specified by a user or computed by a function that evaluates 
the rate of convergence. Early returns are useful in many other roles, besides 
online aggregation. For instance, in a time series we need to find local extrema, 
i.e., valleys and peaks, which are easily handled with early returns. In this case, 
the aggregate might not produce any final return, and this can be specified by 
TERMINATE NOP. 

An important issue brought to a resolution by early returns is that of mono- 
tonicity: in the next section we prove that aggregates with only early returns 
(i.e., those declared with TERMINATE NOP) are monotonic and can be freely 
used in recursion. This provides a surprisingly simple solution to the problem 
of detecting monotone aggregation [15] that had remained open since Ross and 
Sagiv demonstrated the many useful applications of these aggregates [19]. Mono- 
tonic aggregates can be used to express graph traversal algorithms, greedy algo- 
rithms, Bill of Materials (BoM) applications and other computations that were 
previously viewed to be beyond the capabilities of SQL and Datalog [19,9,17,10]. 



48 



Haixun Wang and Carlo Zaniolo 



For example, say that we have defined a mcount aggregate, where PRODUCE 
returns a new partial count for each new element in the set, and thus there 
is no final return. Therefore, mcount is a monotonic aggregate: for a set with 
cardinality 4, mcount will simply produce 1, 2, 3, 4; when a new element is added 
to the set mcount returns 1,2, 3, 4, 5. Thus mcount is monotonic with respect to 
set containment, whereas the traditional count returns first {4} and then {5}, 
where the latter set is not a superset of the former. (Observe, that mcount is 
monotonic and deterministic; the msum aggregate returning the sum so far is 
still monotonic, but nonderministic.) 

Consider now the use of monotonic aggregates in solving recursive problems. 
The Join-the-Party problem states that some people will come to the party 
no matter what, and their names are stored in a sure(Person) relation. But 
others will join if at least three of their friends will be there. Here, f riend(P, F) 
denotes that P regards F as a friend. A monotonic user-defined aggregate mcount 
is used inside a recursive query to solve this problem. The PRODUCE routine of 
mcount returns the intermediary count and its TERMINATE routine is defined 
as NOP. 

Example 4- Join the Party in SQL- AG 

WITH RECURSIVE willcome(Name) AS 
( SELECT Person FROM sure 
UNION ALL 
SELECT f.P 

FROM willcome, friend f 
WHERE willcome. Name = f.F 
GROUP BY f.P 
HAVING mcount(f.F)=3 
) SELEGT Name FROM willcome 

As we shall see later, this program has a formal logic-based semantics, inas- 
much as it can be translated into an equivalent Datalog program that has stable 
model semantics [11]. On a more practical note, a host of advanced database 
applications, particularly data mining applications, benefit from our UDAs. For 
instance, it is possible to express complex algorithms such as the ‘A Priori’ 
algorithm using the monotonic version of count, resulting in more flexibility 
and opportunities for optimization. Since the result of a fixpoint computation 
on monotonic operators is not dependent on the particular order of execution, 
several variations of A Priori are possible; for instance, a technique where the 
computation of item-sets of cardinality n+1 starts before that of cardinality n is 
completed was proposed in [2] . We were also able to implement other data min- 
ing algorithms, such as SPRINT/PUBLIC (1) and iceberg queries [7] in SQL- AG, 
with very little effort. The UDAs were used here to build histograms, calculate 
the gini index, and to perform in one pass the complex comparisons of tree costs 
needed to implement PUBLIC(l) [18]. 
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5 SQL-AG 

Two versions of SQL-AG were implemented, the first on Oracle, using PL/SQL, 
and the second for IBM DB2. Here we describe this second version, which is sig- 
nificantly more powerful and efficient than the other. DB2 supports user-defined 
functions (UDFs) but not user-defined aggregates. The SQL-AG system sup- 
ports SQL queries with UDAs by transforming them into DB2 queries that use 
scratchpad UDFs to emulate the functionality of the corresponding UDAs [3]. 
For instance, say that we want to find the average salary of employees by de- 
partment, using the UDA myavg, instead of the SQL built-in; then we can write: 

SELECT dept, myavg(salary) 

FROM emp 

GROUP BY dept 

This query is translated by SQL-AG into the following query, which can be 
executed by DB2: 

SELECT dept, myavg(dept) 

FROM emp 

WFIERE myavg_groupby(dept, salary)=0 

GROUP BY dept 

Here, the funtion myavg_groupby performs the actual computation of the 
aggregate by applying to each record the INITIALIZE and ITERATE functions 
written by the user (i.e., the functions myavg_single and myavg_multi for 
Example 3), and then returning 0. Finally, for each dept the function myavg(dept) 
applies the TERMINATE function written by the user (i.e., myavg_terminate 
for Example 3) to the last values computed by myavg_groupby, returning the 
final result. Similar transformations are used to handle the case where the UDA 
only has early returns, and the more complex case where both early returns and 
final returns are used. More details about SQL-AG and its implementation can 
be found in [25]. 

We compared the performance of native DB2 builtins against SQL-AG UDAs 
on a Ultra SPARG 2 with 128 megabytes memory. We used a new UDA, myavg, 
which has the same functionality as the builtin aggregate avg. Figure 1 shows 
that, when aggregation contains no group-by columns, our UDAs incur in a mod- 
est performance penalty with respect to DB2 builtins. However, when group-by 
columns are used, then the UDAs of SQL-AG normally outperform DB2’s builtin 
aggregates, as shown in Figure 2. This is due to the fact that DB2 implements 
grouping by pre-sorting all the records, while SQL-AG uses hashing. This ad- 
vantage is lost if the group-by columns coincide with the primary key for the 
relation at hand, and thus the data is already in the proper order. In this case, 
our UDAs are somewhat slower than DB2 builtins — bottom curve in Figure 2. 

Our performance comparison shows that, in general, user-defined aggregates 
can be expected to have performance comparable to that of builtins In fact, 

^ These results were obtained using DB2 UDFs in an unfenced mode [3j. Execution 
in the fenced mode was considerably slower. 
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Number of Records (in million) 



Fig. 1. Aggregates without Group- by 



there are several situations where specialized UDAs will be preferred to builtin 
aggregates simply for performance reasons. For instance, all counts needed in 
Example 2 can be computed in one pass through the data using a hash-based 
approach (and SQL-AG allows the user to specify whether the implementation 
of each aggregate is hash-based or sort-based). In DB2, and other commercial 
systems, an implementation of GROUPING SETS normally results in a cascade 
of sorting operations. As illustrated by Figure 3, this resulted in a substantial 
speed-up, and improved scalability (DB2 on our workstation refused to handle 
more than 800000 records). 

6 Aggregates in Logic 

The procedural attachments used to define new aggregates in SQL-AG could 
leave the reader with the impression that these are merely procedural exten- 
sions, without the benefits of the formal logic-based semantics that provides the 
bedrock for relational query languages and the recent SQL extensions for recur- 
sive queries. Fortunately, this is not the case, and we next provide a logic based 
formalization for UDAs. This also yields a simple syntactic characterization of 
aggregates that are monotonic in the standard lattice of set-containment, and 
can therefore be used without restrictions in recursive queries. This breakthrough 
offers a simple solution to the monotonic aggregation problem, and allows us to 
express applications such as BoM and graph traversals that had long been prob- 
lematic for SQL and Datalog [19,10,15]. 
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Inductive Definition of Aggregates. Aggregate functions on (non-empty) sets 
can be defined by induction. The base case for induction is that of singleton 
sets; thus, for count, sum and max, we have count{{y}) = 1, sum{{y}) = y, 
and max{{y}) = y. Then, by induction, we consider sets with two or more 
elements; these sets have the following form: S U {y}, where U denotes disjoint 
union (thus S is the “old” set while y is the “new” element). Then, our specific 
inductive functions are as follows: sum(S'U {x}) = sum{S) + x, count{SU{x}) = 
count(S) + 1, max{S U {x}) = if x > max{S) then x else max{S) . Thus, 
expressing aggregates in Datalog can be broken down in two parts: (i) writing 
the rules for the specific inductive functions used for this particular aggregate, 
and (ii) writing the recursive rules that enumerate the elements of a set one- 
by-one as needed to apply the specific inductive functions. Part (i) is described 
next, and part (ii) is discussed in the next section. For concreteness, we use here 
the syntax of CVC++ [20,28]. 

In CDC++, the base base step in the computation of an aggregate is ex- 
pressed by single rules that apply to singleton sets, while the induction step is 
expressed by multi rules that apply to sets with two or more elements. Thus, 
we obtain the following definitions for sum 

single(sum, Y, Y). 

multi(sum, Y, Old, New) <— New = Old -|- Y. 

and for max 

single(max, Y, Y). 

multi(max,Y, 01d,Y) ^ Y > Old. 

multi(max, Y, Old, Old) ^ Y <= Old. 
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Therefore, we use the first argument in the heads of the rules to hold the 
unique name of each aggregate. 

Then, the /return rules are used to specify the value to be returned at the 
end of the computation. For sum and max the return rules are as follows: 

f return(sum, Y, Old, Old), f return(max, Y, Old, Old). 



The complete definition of the aggregate avg is as follows: 
single(avg, Y, (Y, 1)). 

multi(avg, Y, (Sum, Count), (Nsum, NCount)) ^ Nsum = Sum + Y, 

Ncount = Count + 1. 

f return(avg, Y, (Sum, Count), Avg) <— Avg = Sum/Count. 

The CDC++ extension recently developed at UCLA also supports early re- 
turns, which must be specified using ereturn rules. Thus, if the user wants to 
see partial results from the computation of averages every 100 elements, the 
following rule must be added: 

ereturn(avg, X, (Sum, Count), Avg) ^ 

Count mod 100 = 0, Avg = Sum/Count. 

In order to find the average salary of employees grouped by department, the 
user can thus write: 

p(DeptNo, avg(Sal)) <— empl(Ename, Sal, DeptNo). 
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Thus, in this syntax, that is shared by both CVC++ [28] and CORAL [17], 
aggregates, such as avg(. . .), are used as arguments in the head of the rule, and 
the remaining non-aggregate arguments are interpreted as group-by attributes. 

The head aggregate construct can be viewed as a meta-level construct with 
first order semantics; as shown in the next section, it can be expanded into the 
internal rules that, along with the single, multi, freturn and ereturn rules written 
by the user, express the formal meaning of aggregates in logic. 

Let us now define a logic-based equivalent of the SQL- AG program of Exam- 
ple 4. We begin by defining mcount that returns the incremental count at each 
step: 

single(mcount, Y, 1). 

multi (mcount, Y, Old, New) <— New = Old -f 1. 
ereturn(mcount, Y, Old, New) v- Old = nil. New = 1. 
ereturn(mcount, Y, Old, New) ^ Old nil, New = Old -|- 1. 

The first ereturn rule applies when Old = nil (where nil is just a special 
value — not the empty list). Now, the condition Old = nil is only satisfied when 
the first Y value in the set is found; thus, this rule is enabled together with 
single rule, and produces the integer 1. After that, the second ereturn rule 
applies repeatedly, in parallel with the multi rule, producing 2, . . . ,n, where n 
the number of items counted so far. 

The query, “Eind all departments with less than 7 employees^’’ can be ex- 
pressed as follows: 

count_emp(D#,mcount(E#)) <— emp(E#, Sal, D#). 
large_dept(D#) <— count_emp(D^, Count), Count = 7. 

small_dept(D#, Dname) <— dept(D#, Dname), ^large_dept(D#). 

This example illustrates some of the benefits of online aggregation. Negated 
queries are subject to existential variable optimization; thus, in CDC-\ — I- the 
search for new employees of a department stops as soon as the threshold of 7 is 
reached. But the traditional count must retrieve all employees in the department, 
no matter how high their count is. 

Several authors have advocated extensions to predicate calculus with gener- 
alized existential quantifiers [13,14], to express a concept such as “There exist 
at least seven employees’’’ . This idea is naturally supported by new aggregate 
atleast((K, X)) that returns the value yes as soon as K instances of X are counted. 
This aggregate of Boolean behavior can be defined as follows: 
single(atleast, (K, Y), 1). 

multi(atleast, (K, Y), Old, New) <— Old < K, New = Old -|- 1. 
ereturn(atleast, (K, Y), Kl, yes) K1 = nil, K = 1. 

ereturn(atleast, (K, Y), Kl, yes) Kl nil, Kl -|- 1 = K. 

Then, a predicate equivalent to the large_dept(D#) can be formulated as 
follows: 



lrg_dpt(D#, atleast((7, Ename))) <— empl(Ename, Sal, D#). 
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Here, because of the condition Old < K in the multi rule defining 
atleast, the search stops after seven employees, even for a positive goal 
?lrg_dpt(D^, yes), for which no existential optimization is performed. 

Observe that mcount and atleast aggregates define monotonic mappings 
with respect to set containment: in the next section, we prove that all UDAs 
defined with only early returns (e.g., online aggregates) are monotonic, and can 
be freely used in recursive queries and rules. 

7 Formal Semantics and Monotonicity 

The logic-based semantics of a program with aggregates can be defined by view- 
ing it as a short-hand of another Datalog program without aggregates. For that, 
we need the ability of enumerating the elements of the set one-by-one. For in- 
stance, if we assumed that the set elements belong to a totally ordered domain, 
then we could visit them one-at-a-time in, say, ascending order. But such an 
assumption would violate the genericity principle [1]; moreover, it still requires 
nonmonotonic constructs to visit the elements one-by-one, thus preventing the 
use of aggregates in recursive rules. A better solution consists in using choice 
[21,22], or more precisely the dynamic version of choice [12], which can used 
freely in recursive rules. By enforcing functional dependencies on the result pro- 
duced by the rules, this powerful construct allows us to derive the Ordering rules, 
below, which arrange the elements of the set in a simple chain. 

Positive choice programs are equivalent to programs with negated goals; these 
programs are guaranteed to have one or more total stable models [1 1] . As shown 
in [12], choice is strictly more powerful than other nondeterministic construct 
previously defined, including the witness operator of Abiteboul&Vianu [1], and 
the static version of choice [12]. This added power allows us to express com- 
putations that would not have been possible using witness or static choice. In 
particular a positive choice program can be used to order the elements of a 
sets into a chain [12]. This operation is critical in our inductive definition of 
aggregates discussed next. 

For instance, say that we have the following rule where we apply myagr on 
the Y-values grouped by X: 

r : p(X, myagr (Y)) q(X,Y). 

The mapping from the body to head of this rule can expressed by (i) the 
nextr rules that arrange the Y-values of q(Y) into a chain, (ii) the cagr rules 
that implement the inductive definition of the aggregate by calling the single 
and multi rules, and (iii) the yield-rules that produce the actual pairs in p by 
using the ereturn and f return rules. 

The nextri rules use the choice construct of CDC++'. 

Ordering Rules: 

next_r(X, nil, nil) ^ q(X,Y). 

next.r(X, Y1,Y2) ^ next.r(X, _, Yl), q(X, Y2), 

choice((X, Yl), (Y2)), choice((X, Y2), (Yl)). 
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Aggregates can be defined by the following internal recursive predicate cagr: 
cagr Rules 



cagr(myagr, X, Y, New) ^ next_r(X, nil, Y), Y yf nil, 

single(myagr, Y, New). 
cagr(myagr, X, Y2, New) ^ next_r(X,Yl, Y2), 

cagr(myagr,X,Yl,01d), 
multi(myagr, Y2, Old, New). 

The cagr rules implement the inductive definition of the UDA by calling on 
the single and multi predicates written by the user. Therefore, single is used 
once to initialize cagr(myagr, X, Y, New), where Y denotes the first input value 
and New is value of the aggregate on a singleton set. Then, for each new input 
value, Y2, and Old (denoting the last partial value of the aggregate) are fed to 
the multi predicate, to be processed by the multi rules defined by the user and 
returned to head of the recursive cagr rule. 

Here, we have left the bodies of these rules unspecified, since no “special” 
restriction applies to them (except that they cannot use the predicate p being 
defined via the aggregate, nor any predicate mutually recursive with p). 

The predicates ereturn and f return are called by the yield rules that control 
what is to be returned: 

Early- Yield Rule: 

p(X, AgrVal) ^ next_r(X, nil, Y), Y yf nil, 

ereturn(myagr, Y, nil, AgrVal). 
p(X, AgrVal) ^ next_r(X, Yl, Y2), 

cagr(myagr,X,Yl,01d), 
ereturn(myagr, Y2, Old, AgrVal). 

The first early-yield rule applies to the first value in the set, and the second 
one to all successive values. The result(s) returned when all elements in the set 
have been visited is controlled by a final- yield rule: 

Final-Yield Rule: 



p(X, AgrVal) ^ next_r(X, Y), ^next_r(X, Y, _), 
cagr(myagr,X,Y, Old), 
f return(myagr, Y, Old, AgrVal). 

This general template defining the meaning of all aggregates is then cus- 
tomized by the user-supplied rules for single, multi, ereturn, and freturn, 
which all have mvavg as their first head argument (thus the aggregate name is 
used to avoid interference with other UDAs). 



Monotonicity Observe that negation is only used in the final yield rule. When 
the aggregate definition contains no final-return rule (i.e., only early return 
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rules) then the final-yield rule can be eliminated and the remaining rules consti- 
tute a positive choice program. Now, positive choice programs define monotonic 
transformations — an important result obtained in [12] that will be summarized 
next. 

As customary in deductive databases, a program P can be viewed as con- 
sisting of two separate components: an extensional component, denoted edb(P), 
and an intensional one, denoted idb{P). Then, a positive choice program defines 
a monotonic multi-valued mapping from edb(P) to idb(P), as per the following 
theorem proven in [12]: 

Theorem 1. Let P and P' he two positive choice programs where idb{P') = 
idb(P) and edb{P') A edb(P). Then, if M is a choice model for P, then, there 
exists a choice model M' for P' such that M' A M . 

Thus, for a multi-valued function we only require that, as the value of the 
argument increases, some of the values of the function also increase (we do not 
require all values to increase). Furthermore, we say that we have a fixpoint 
when one of the function values is equal to its argument. Each multi-valued 
mapping also induces a nondeterministic (single-valued) mapping, defined as an 
arbitrary choice among the values of the function. As shown in [12], for choice 
programs, a fixpoint is reached by the inflationary repeated application of such 
a nondeterministic mapping. 

Furthermore the set of these fixpoints coincide with the stable models [11] 
of the program obtained by rewriting the choice program into an equivalent 
program with negation [21]. Thus, our nextr rules are formally defined by their 
equivalent rules with negated goals: 

next.r(X, Yl, Y2) ^ next.r(X, Yl), q(X, Y2), 

chosen(X, Y1,Y2). 

chosen(X, Y1,Y2) ^ next.r(X, Yl, Y2), ^dif f choice(X, Yl, Y2). 

diffchoice(X, Y1,Y2) ^ chosen(X, Yl, Y2'), Y2' Y2. 

diffchoice(X, Y1,Y2) <- chosen(X, Yl', Y2), Yl' Yl. 

This program, as every choice program reexpressed via negation, has one or 
more (total) stable models [11], where each stable model satisfies all the FDs 
defined by the choice goals [22,12]. 

Therefore, keeping with previous authors [16], we have defined the semantics 
of aggregates in terms of stable models [11]; however, through the use of the 
choice construct, we have avoided the computational intractability problems of 
stable models. Furthermore in our semantics, choice rules are only used to de- 
liver the next value Y2 generated by our rule r (for the given group-by value 
X and the previous such value Yl): thus, an operational realization is very sim- 
ple and efficient since it reduces to a get-next operation on data. Furthermore, 
since aggregates define monotonic transformations in the usual lattice of set 
containment, bottom-up execution techniques of deductive databases, such as 
the semi-naive fixpoint, and magic sets, remain valid for these programs. Thus, 
monotone aggregates can be added to deductive database systems with no change 
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in execution strategy — a conclusion that also applies to recursive queries with 
monotone aggregates in SQL DBMSs. 

8 Programs with Monotone Aggregation 

We now express several examples derived from [19] using our new monotonic 
aggregates [28]. 

Join the Party . The SQL-AG query of Example 4, can be expressed in CVC 
using the monotonic aggregate mcount and an additional predicate Cfriends. 

willcome(P) ^ sure(P). 

willcome(P) ^ c_f riends(P, K), K >= 3. 

c_f riends(P. mcount(F)) <— willcome(F), f riend(P, F). 

Here, we have set K = 3 as the number of friends required for a person to 
come to the party. Consider now a computation of these rules on the following 
database. 



sure(mark). friend(jerry,mark). 

sure(tom). f riend(penny, mark). 

sure(jane). friend(jerry, jane). 

f riend(penny, jane). 
friend(jerry, penny), 
f riend(penny, tom). 

Then, the basic semi-naive computation yields: 

willcome(mark), willcome(tom), willcome(jane), 

c_friends( jerry, 1), cjfriends(penny, 1), cjfriends( jerry, 2), 
c_friends(penny, 2), cjfriends(penny, 3), willcome(penny), 
c_friends( jerry, 3), willcome(jerry). 

This example illustrates how the standard semi-naive computation can be 
applied to queries containing monotone user-defined aggregates. 

The Join-the-Party query of Example 4 eliminates the need for a c_friends 
predicate by using the ‘having’ construct . In CVL++, we can obtain the same 
effect by using the aggregate atleast defined in Section 6, which is also mono- 
tone: 



wllcm(F, yes) <— sure(F). 

wllcm(X, atleast((3, F))) <— wllcm(F, yes), f riend(X, F). 

Unlike in the previous formulation, where a new tuple c.friends is produced 
every time a new friend is found, a new wllcm tuple is here produced only when 
the threshold of 3 is crossed. 
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Company Control Another interesting example is transitive ownership and con- 
trol of corporations. Say that owns(Cl, C2, Per) denotes the percentage of shares 
that corporation Cl owns of corporation C2. Then, Cl controls C2 if it owns more 
than, say, 50% of its shares. In general, to decide whether Cl controls C3 we must 
also add the shares owned by corporations such as C2 that are controlled by Cl. 
This yields the transitive control predicate defined as follows: 

control(C, C) <— owns(C,_, _). 

control(0nr, C) towns(Dnr, C, Per), Per > 50. 

towns(0nr, C2, msum(Per)) control(Dnr, Cl), owns(Cl, C2, Per). 

Thus, every company controls itself, and a company Cl that has transitive 
ownership of more than 50% of C2’s shares controls C2 . In the last rule, towns 
computes transitive ownership with the help of msum that adds up the shares of 
controlling companies. Observe that any pair (Onr, C2) is added at most once to 
control, thus the contribution of Cl to Onr’s transitive ownership of C2 is only 
accounted once. 

9 Conclusion 

The practical importance of database aggregates has long been recognized, but 
indepth treatments of this critical subject were lacking. In this paper, we have 
addressed both the theoretical and practical aspects of aggregates, including 
user-defined aggregates and online aggregation. Our logic-based formalization 
of aggregates provided a simple and practical solution to problem of monotone 
aggregation, a problem on which many previous approaches had achieved only 
limited success [17,15,10,19]. 

Various examples were also given illustrating power and flexibility of UDAs 
in advanced applications; several more examples, omitted because of space lim- 
itations, can be found in [25]. For instance, by adding greedy aggregates built 
upon priority queues, we expressed graph algorithms such as Dijkstra’s single 
source least-cost path, or Prim’s least-cost spanning tree. Also data mining func- 
tions, including tree classifiers and A Priori, can be formulated efficiently using 
our UDAs. 

At UCLA, we developed the the SQL- AG prototype that supports the UDAs 
here described on top of DB2 [25], and we also developed a new version of 
CDC++ [28] supporting the Datalog extensions described in this paper. The 
SQL-AG implementation is of particular significance, since it shows that UDAs 
are fully compatible with 0-R systems, and can actually outperform builtin 
aggregates in particular applications. 

We are currently investigating the issue of ease of use in UDAs. In fact, while 
UDAs in £!)£++ can be expressed using rules, several procedural language 
functions must be written to add a new UDA in SQL-AG or SQL3. However, 
our experience suggests that in most UDAs the computations to be performed 
by the INITIALIZE, ITERATE, TERMINATE, and PRODUCE functions are very 
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simple, and can effectively be expressed using an (SQL-like) high-level language. 

We expect that this approach will enhance users’ convenience, and portability. 

A simple SQL-like language for UDAs is described in [27]. 
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Abstract. A string database is simply a collection of tables, the columns 
of which contain strings over some given alphabet. We address in this pa- 
per the issue of designing a simple, user friendly query language for string 
databases. We focus on the language FO{»), which is classical first or- 
der logic extended with a concatenation operator, and where quantifiers 
range over the set of all strings. We wish to capture all string queries, i.e., 
well-typed and computable mappings involving a notion of string gener- 
icity. Unfortunately, unrestricted quantification may allow some queries 
to have infinite output. This leads us to study the “safety” problem for 
FO{»), that is, how to build syntactic and/or semantic restrictions so 
as to obtain a language expressing only queries with finite output, hope- 
fully all string queries. We introduce a family of such restrictions and 
study their expressivness and complexity. We prove that none of these 
languages express all string queries. We prove that a family of these lan- 
guages is equivalent to a simple, tractable language that we call SriQueL, 
standing for String Query Language, which thus emerges a robust and 
natural language suitable for string querying. 



1 Introduction 

Current database management systems, especially those based on the relational 
model, have little support for string querying and manipulation. This can be a 
problem in several string-oriented application areas such as molecular biology 
(see e.g. [6,31]) and text processing, the latter becoming crucial with the burst 
of the Web, XML and digital libraries, among others. In such a system, a string 
is one of the basic data types of Godd’s relational model [5], which means that 
the strings are treated as atomic entities; thus a string can be accessed only 
as a whole and not on the level of the individual characters occuring within 
it. Modern object-oriented systems are usually alike in this sense. Although 
they offer support for complex objects, strings are usually treated as atomic. In 
SQL, the only non-atomic operator is the LIKE-operator which can be used in 
simple pattern matching tasks such as finding a substring in a field; however. 
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the expressive power of the operator is limited. In this paper, we address the 
issue of designing a general purpose query language for string databases, based 
on first order logic. 

Lately we have witnessed in the database community an increased research 
activity into databases having strings, tuples and sets as the principal datatypes. 
The area deserves to be regarded as a subfield of database theory. It is called 
“string databases” or “sequence databases,” depending on the authors. Here we 
shall use the former name. 

We extend the relational model to include finite strings over some given 
finite alphabet E as primary objects of information. A relation of arity k in our 
model is then a finite subset of the fc-fold Cartesian product of E*, the set of 
all finite strings over E, with itself. In other words, each position in a tuple of 
a relation contains a string of arbitrary length instead of just a single atomic 
value. This definition was essentially introduced in [14] an later used in e.g. [17] 
and [24]. However, a brief excursion into the history books reveals for instance 
that Stockmeyer [.34] was familiar with string relations. Earlier still Quine, [26] 
showed that first order logic over strings is undecidable. 

From the point of view of design, in addition to data extraction features, such 
as “retrieve all palindromes,” the string language needs also data restructuring 
constructs [14,17,24,35]. For example, given two unary relations, one might want 
to concatenate each string from one relation with a string from the other relation, 
as opposed to merely taking the Cartesian product of the two relations. The 
former returns a set of “new” strings, whereas the latter returns a set of pairs 
of strings previously existing in the input instance. 

Adequate query languages for string databases have been the aim of several 
attempts, mainly [14,17,24] (see Section 6 for works less close to ours). These 
three provide full power languages, based on sophisticated primitives, namely 
transducers, datalog with or without negation, and an original logic equivalent 
to multi-tape automata. However, in addition to finding complete languages, 
there is a need to define, and understand in depth, languages a la SQL, user- 
friendly, and of low time complexity, without recursion and within logspace. 
The “user-friendly” aspect was part of the motivation of the theory of range 
restriction, inspiring numerous works (see e.g. [1], Chap. 5.3 and 5.4). 

In this paper we return to the spirit of “the founding fathers” ([34,26]). In 
other words, we will use as string-language relational calculus with an interpreted 
concatenation function. Our syntax will be called FO{»), where the symbol • 
stands for concatenation. We will define various semantics, one of them yielding 
our main language. Since according to a central authority in the field [37], SQL 
(i.e. FO) is “intergalactic dataspeak,” we take a parsimonious stand to stick to 
FO as close as possible. In spite of its rather “formal” look, we believe the sim- 
plicity and low complexity of our main language to be well-suited for “everyday” 
database querying and programming. As a consequence we called it StriQueL, 
standing for String Query Language. 

The contributions of this paper are the following. String queries are formally 
defined, with emphasis on string genericity. Given the syntax FO{*), we then 
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consider first evaluating query formulas by making quantifiers range over the 
whole domain S* . This yields the query language |-FO(*)]„ai- The question is 
then: What are the relationships between |-FO(»)]nai and string queries? Does 
the former capture all, or only some of the latter? 

One can pursue a “top-down” or a “bottom-up” approach. The “top-down” 
approach would consist in taking the whole language |T’0(*)]„at, and restricting 
it syntactically without sacrificing expressive power. In this direction, we prove 
that the problem whether a formula expresses a string query is (not surprisingly) 
undecidable. We then undertake a “bottom-up” approach: designing a very sim- 
ple language, obviously capturing only string queries, then empovering it little 
by little. 

We first study three versions of a restricted semantics, where the quantifiers 
are not allowed to range over the full domain S* , and we compare them to each 
other. In a parallel approach, we define syntactic restrictions yielding a language 
that, we believe, corresponds naturally in the string context to the intuitive 
relational “SQL level.” That is, its comfort and expressivity are the ones we wish 
for “everyday” querying of the database by end-users. We call it FOrr{*) (for 
range-restricted) . 

It turns out that FOrr{*) is equivalent to one of our semantic restrictions. 
This result emphasises the robustness of the language. All these properties made 
us call it StriQueL — String Query Language. 

Now, it is not surprising that we show that our StriQueL language does 
not express all string queries. More precisely, its complexity is shown to be of 
the same order of magnitude as its homologue the pure relational SQL, namely 
logarithmic space. 

Although considered a quality for “everyday” use of string databases, these 
limitations in expressive power are not desirable for more advanced applications. 
We consider then how to overcome the limitations. We define a safe, non range- 
restricted, fragment of FO{»), through an operator schema F. It is based on an 
interesting extension, namely introducing an additional symbol in S along with 
particular constraints. 

All languages introduced (except |AO(*)]„ai) are shown to compute only 
string queries, and to be actually evaluable (their semantics is constructive). 

Although in molecular biology attempts of string manipulation have been 
proposed based on grammatical constructs, we believe first order logic over 
strings should provide an original and very flexible manipulation tool for this 
area. Moreover, it includes naturally some forms of pattern-matching capabil- 
ities. This issue is briefly discussed. Finally, the expressive power of the pure 
relational fragment of our languages is compared with pure relational FO. 

The remainder of the paper is organised as follows. String queries and generic- 
ity, |AO(*)]„at and the discussion on the decidability of being a string query form 
Section 2. Section 3 introduces restrictions yielding our String Query Language, 
and contains a study of it. The F operator and the study of the corresponding 
language are presented in Section 4. In Section 5, we sketch the expressive power 
of our languages in terms of other formalisms, namely the pure relational and 
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formal languages. Related works are detailed in Section 6, before concluding and 
presenting perspectives. 



2 Definitions and Problems 

In this section, we set up the basic definitions for our work. First, string queries 
are defined, emphasising string genericity. The syntax FO{u) and the semantics 
nat are given. We then briefly discuss the difficulty of deciding whether 
a formula represents a string query. This difficulty is a strong motivation for our 
“bottom-up” approach to restricting the syntax or semantics for FO{u) (i.e., we 
shall begin with very simple syntax/semantics that we enrich little by little). 
Throughout this paper we assume that a fixed finite alphabet E is given. 



2.1 Queries and String Genericity 

In our study of query languages for string databases, we first need to fix a 
definition of query. Although several definitions are possible, we choose in this 
paper one inspired from the traditional one in the relational model (see e.g. [1] 
for a definition and discussion of relational genericity). We adapt the traditional 
definition to string databases. Well-typedness and computability are essentially 
the same as in the relational case. String genericity is new. The idea of string 
genericity is to identify mappings that differ in only renaming of string symbols, 
i.e., letters of the fixed alphabet E. 

A relation r of arity k in our model is a finite subset of the fc-fold Cartesian 
product of E*, the set of all finite strings over E, with itself. The arity of r, 
denoted a(r), is defined to be k. A string database instance / is a finite sequence 
of relations (ri, . . . , r„). The schema of I is (a(ri), . . . , o;(r„)). 

We consider mappings from string databases to string relations. We’ll some- 
times call them also string mappings to emphasise the context. Well-typedness of 
a mapping simply says that the input and output schemas are fixed. Let us recall 
also computability [1], which immediately applies string mappings. A mapping h 
is computable if there exists a Turing machine M, such that for each instance /, 
given I encoded on the tape, M does the following. If h{I) is defined, M com- 
putes the encoding of h(/), writes it on the tape and stops. Otherwise M doesn’t 
stop. 

We now turn to the specificity of string mappings. First, recall that a mapping 
is relational generic if it is invariant under permutations of the domain (the 
domain being If*). We will require in addition that permutations preserve the 
structure of strings. To this end we say that a permutation p of E* is a string 
morphism, if for all u and v in E*, p(u.v) = p{u).p(y). In other words, symbols 
of E are “repainted” without changing the way they combine within strings. 
Now, a mapping h from string databases to string relations is string generic if 
for each permutation p of E* which also is a string morphism, we have that for 
all instances /, 



p{h{I)) = h{p{I)). 
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The concept of C-genericity for any finite subset C of E* is defined as usual. 
C-genericity will be used to allow constants in query expressions.^ 

As stated in below, a relational generic mapping is also string generic, but 
some string generic mappings are not relational generic. This phenomenon is due 
to the fact that string isomorphisms of E* are simply particular permutations of 
E* . For instance, the mapping R(u) i— > Riu.u) is string generic but not relational 
generic. 

Fact 1 Every relational generic mapping is string generic, but the converse does 
not hold. 

We now formally define string queries. 

Definition 1 Let (ai,...,afc) be a database schema and b an arity. A string 
query of type (ai , . . . , Ofc) ^ 6 is a partial string generic and computable mapping 
from the set of all database instances over (ai, . . . , a^) to the set of all relations 
of arity b. 

2.2 A Family of Query Languages for the FO{») Syntax 

In this section, we assume classical notions and vocabulary (see e.g. the definition 
of FO in [1]) and focus on our new string primitive. The syntax FO(u) and 
a parameterised semantics |FO(*)]d are presented. A difference with classical 
first order logic is that valuations and quantifiers only range over the domain d 
considered, where d C E*. 

Syntax. We introduce here the syntax FO{»). It is an extension of first order 
predicate logic {FO) with a natural concatenation operator for strings, the dot 
operator, denoted •. Terms are variables or constants from E* , or of the form ti* 
^2, where ti,t2 are themselves terms. Formulas are built as usual from relation 
symbols and equality between terms, to give atoms; and inductively with the 
operators A, V, 3, V. 

Semantics. Let be given terms an instance I and a valuation v for the 

variables appearing in t\,t2, i.e., a mapping from these variables to E* . The 
interpretation of ti • ^2 under / and v, denoted I{ti • t2), is I{ti).I{t2), where, 
in this expression, denotes the usual semantic concatenation of strings (asso- 
ciative, with neutral e); and as usual I{x) = v{x) for a variable x, and I{u) = u 
for u € E*. Satisfaction of a formula ip of FO{u) is then inductively defined. 
Atoms and connectives A,V,^ are as usual. The semantics of quantifiers 3,V is 
obtained by making variables range over some given domain d C E* (see e.g. 
the definition of relativized interpretations in [1]). 

^ A way to define string genericity that might be considered more natnral could be 
to define permutations p over E (instead of E*), and then to extend p to strings 
in E* and database instances in the straightforward way; this would avoid defining 
string morphisms. 
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Definition 2 Let be an FO{u) formula with x being a vector of its free 
variables. Given dC if*, the mapping expressed by ip under the semantics d is 
defined as 

Md(.f) = {^^(5) : r’ is a valuation of x making (p{x) true in /}, 
where both v and the quantifiers of tp range over d. 

The set of all mappings expressed by query expressions in FO{*) under the 
semantics d is denoted |FO(*)]d- 

At this point, we have in hand a rich and simple definition of a family of 
languages |FO(*)]d based on the syntax FO{*) and parameterised by the do- 
main d. Our search for admissible semantic restrictions will go along the line of 
carefully defining more and more powerful domains d. 

2.3 Where and How to Look for the Right Language Issues and 
Directions 

In this section, we present our approach to a search for an admissible query 
languages for string databases. 

Choosing a query language among ours now amounts to fixing the ranging 
domain d. The natural choice for d would be to take the whole S* . This is called 
the natural semantics, and defines a a query language |T"0(*)]„at that we will 
consider now. 

We want to write formulas ip in FO{u) such that the corresponding mapping 
is a string query. The following fact is straightforward. 

Fact 2 There is an FO{*) formula ip such that |(/3]nat is not a string query. 

We are thus faced with the task of determining those formulas that express 
string queries (under the natural semantics) . However, as stated below, a direct 
approach is doomed to fail. 

Fact 3 Given a formula ip in FO{*), it is undecidable whether is a string 

query. 

Our purpose then becomes finding a syntactic fragment of FO{u) that cap- 
tures all and only string queries under natural semantics. Our language StriQueL 
is a first step in this direction (see Section 3.3), but, due to its pragmatic low- 
complexity nature, it obviously does not express all string queries. The next 
step, beyond the scope of this paper, would be to design such a sound and com- 
plete language, or show that the class of string queries does not have an effective 
syntax. 

As a starting point, one might look to the results of Makanin [8,23], from 
which it follows that the satisfiability problem for conjunctive FO{*) queries 
under natural semantics is decidable. However it is not clear if Makanin’s tech- 
niques can be extended to show that a conjunctive FO{*) formula expresses a 
string query. 
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3 Restricting the Language 

3.1 Semantic Restrictions 

In this section, several progressively richer semantics, that is, semantics that 
allow expressing more and more string queries, are introduced. The expressivity 
of the query languages they generate are compared. 

In the traditional relational setting, there are known ways to restrict the FO- 
language to only generate relational queries. The queries can then be evaluated 
using the so called active domain semantics (see e.g. [1]). This is due to the 
genericity property which enforces the output of queries to be within the active 
domain. In our setting string genericity allows to construct strings that are 
outside the active domain and an FO{u) formula can generate a string query 
even if the output is not included in the active domain. For instance take (f{x) = 
3y, z[{x = y • z) A R{y) A R{z)], where the output will consist of concatenations 
of strings from the input instance. We therefore have to develop new techniques 
adapted to the string setting. 

Before giving the formal definitions below, let us first give a flavour of the 
issue. The “bottom possibility,” i.e., sticking to the relational concepts, consists 
in making the quantifiers range over the active domain, adorn, (the strings in 
the input instances, but not their substrings). This use of the classical database 
notion of the active domain in the context of string queries appears in [15]. 
Unfortunately, extracting the square root of some strings in the input instance, 
as in e.g. (f{x) = R{x • x), is possible only if the string x is itself in the 
input instance. Taking the extended active domain, eadom, (i.e., considering also 
substrings of string in the input instance) does the trick. The domain eadom was 
introduced in [24]. However, this domain does not allow to build up new strings 
using for instance the query yy{x) = 3y [R{y) A x = y • y]. If we in addition to 
substrings allow k concatenations of strings in the input instance we can handle 
the previous query. The corresponding domain is called eadom^ . This domain 
was not considered in previous works; however it can be seen as a combination 
of eadom and of the adom^ used in [15]. 

The next step would be not bound the construction of new strings by some 
constant k. This would make the domain, and thus potentially the output, infi- 
nite. Consider for example the query (p{x) = 3y, z [R{y) A x = y • z]. Hence there 
is a need of a bound of some kind. But this bound could depend on the instance, 
as opposed to the query. Such a language is considered in Section 4. 

In spite of its restrictions, used in an adequate manner, eadom^ yields an 
appealing and useful language, as we shall see below. 

We now proceed to the formal definitions. Given an instance I and a formula 
ip in FO{»), the active domain of ip and I, denoted adom(ip,I), is the set of all 
strings occurring in ip or in the columns of the relations in /. The extended active 
domain of ip and I, denoted eadom(ip,I), is adom(ip,I) closed under substrings. 
For each strictly positive integer k, the set of strings obtained by concatenating 
at most k strings from eadom(ip, I) gives us eadom{ip, I)^ . 
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We now consider the three above possibilities for the ranging domain d. 
We thus get three different string query languages, namely \FO{»)\adom, 
lFO{»)}eadom, and lFO{»)]eadom’‘- This gives us the following pleasant state 
of affairs. 

Proposition 1 For any FO{u) formula ip, and any k>2, {ifladom, Ifjeadom, 
and \<~p\eadom’^ o,re string queries? 

The issue in the next result (and again in Section 5), is whether enriching the 
semantics, and then restricting to mappings whose input/output is in the active 
domain, yields new string queries. In the following case it indeed does give more 
power. 

Proposition 2 Let |FO(»)]eadom G adorn be the subset of those mappings 
in lFO{u)leadom where the output eontains only strings in adorn. Then 
lFO{rn)ladom is a proper subset of lFO{u)leadom G adorn. 

Crux. The proof uses the following lemma which is false for string queries in 
|FO(*)]eadom- The lemma is proved using techniques from [3]. 

Lemma 1. On instances with only one element. Boolean \FO{»)\adom queries 
are constant. 

The next results says that allowing concatenations in the domain strictly 
increases the expressive power of the corresponding query language. 

Proposition 3 {FO{»)\eadomk is a proper subset of |FO(*)]gadom'=+i ; for each 
fc > 1. 

Now comparing \FO{*)\eadom to |T"0(»)]eadom'' Geadom, the latter equaling 
lFO{u)leadom>‘ restricted to mappings that have output only containing strings 
from lFO{u)leadom, it turns out that G eadom is a succinct 

version of [F 0{»)j eadom- 

Proposition 4 For each FO{u) formula <p with m free variables {xi, . . . ,Xm} 
there exists a an FO{u) formula tf with k x m free variables such that for all 
instances I, we have {(pjeadom'^il) = lipjeadomil)- 

As a consequence lFO{»)}eadom and lFO{»)]eadom>‘ G eadom has the same 
expressive power. 

Proposition 5 lFO{»)jeadom = lFO{»)jeadom>= G eadom. 



2 



Note that lFO{»)jeadom^ = lFO(»)jeadom 
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3.2 Syntactic Restrictions 

We now define range-restricted formulas by restricting the syntax of the lan- 
guage, and evaluating queries under the natural semantics. Intuitively, the idea 
of range-restriction here, is to carefully track, throughout a formula, whether 
each variable will take its values among substrings of the instance (eadom), or 
a finite number of concatenations of them. 

We use as a basis the method given in [1], Algorithm 5.4.3. The relational 
algorithm is extended by adding the following possibilities for ip. 

Definition 3 Let ip he a FO{*) formula. Then rr{ip), the set of range-restricted 
variables of ip, is defined as follows: 

If ip is of the form R(fi , . . . , then rr(ip) = the set of all variables appearing 
in , . . . , tn . 

If ip is of the form x = u or u = x, where u S S* then rr(ip) = {a;}. 

If ip is of the form u = ui • xi • • • • • • Vn+i, where u and the Vi ’s are 

in E* (possibly e) then rr{ip) = {xi , . . . , Xn}- 

If ip is of the form ipi A x = vi • x\ • ■ ■ ■ • Vn • Xn • Vn+i then rr{ip) = 
if X & rr{ipi) then rr{ipi) U {xi, . . . ,x„}; 
if all Xi ’s are in rr(ipi), then rr(ipi) U {x}; 
otherwise rr{ipi). 

Negation and existential quantification are as in [1]. 

A formula ip is said to be range-restricted if rr{ip) equals the set of free 
variables in ip. The set of all range-restricted FO{u) formulas is denoted FOrr{*)- 

Proposition 6 For any FOrr{*) formula ip, the mapping |</9]nat is a string 
query. 

Crux. The idea is given in the proof of Theorem 7 below. 

3.3 StriQueL, The String Query Language Expressivity and 
Complexity 

Theorem 7 FOrr{») = V}k>ilPO{»)\eadom’‘ 

Crux. Given a formula ip in FOrr{-) we compute k with the same algorithm as 
rr(ip) in the following manner. 

If ip is of the form R{x) then fca, = 1. 

If ip is of the form x = u {u £ E*) then = 1. 

If ip is of the form x = y • u then kx = ky -\- 1. 

If ip is of the form x = yi • j /2 then kx = 2 x max{ky ., , fcyj). 

On the basis of the robustness emphasised by this equivalence, and on its 
comfort, we take FOrr{*) as being our string query language: StriQueL. 

Evidently the StriQueL language does not express all string queries, just as 
SQL does not express all relational queries. 
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Theorem 8 There is a string query outside \FOrri*)\nat- 

Crux. The idea is to find a |i^O(»)]nat formula ip that computes the concatena- 
tion of all the strings in some input instance I. If this is the case, ip cannot be 
range-restricted, because from Proposition 4 and Theorem 7 we know that the 
output of range restricted formulas are at most k concatenations of strings in /, 
for a given k independent of I. 

We slightly adapt now the usual relational definition of data complexity for 
strings. Given a instance /, its size |/| corresponds to the sum of the lengths 
of all the strings that occur in I. Complexity measures are now given in terms 
of |J|. 

Theorem 9 Let ip he an FO{*) formula and I an instance. Then {ipladomil), 
\<p\eadom{I) , and \ip\eadom’^{I) are Computable in logarithmic space. 

Crux. The idea of the proof is more intricate than in the relational case, although 
it is based on the same principle. For \ip\adom{I) the proof goes as in the usual 
setting. For \ip\eadom{I) the crux is that a string in eadom can be described using 
two pointers (one for the beginning, one for the end) on the input instance, it is 
therefore easy to adapt the proof of \ip\adom{I) to this case. The same holds for 
iTleadom’-i.I), with 2k pointers. 



4 Beyond Constant Bonnds 

StriQueL, our language that was presented in previous section is, we believe, 
well-suited to “everyday” database programming and querying by end-users. 
However, some more advanced applications and/or programmers may want more 
powerful languages. For all semantics in the previous section, the number of 
concatenations in the creation of new strings is bounded by a constant inde- 
pendent of the input. (Nevertheless, in our StriQueL this constant depends on 
the query, and is not only arbitrarily fixed for the whole language as in, e.g., 
|FO(*)]gadom'=-) It is clear that this constraint should be relaxed for certain 
situations. 

Several possibilities arise thus for the domain, for instance the following: The 
domain could be eadomC (or more generally (Fll/)” C if”), for some n depending 
on /, e.g. n = |/|, or n = max{\u\ : u G /}, or any arbitrary total function, or 
family of functions (for instance, n = |J| being fixed, IJ^ polynomial 
of the more naive solutions yielding this might be for instance the following. 
Given / and n = |/|, take for domain eadom". 

In this section, we provide a more powerful language, again a variant of 
lFO{»)}nat- This language comes with a constructive semantics, i.e., an algo- 
rithm to compute the output of a query. 

As we discussed above, such languages may range from ones capturing all 
string queries, down to while/ fixpoint-like ones, or even simpler ones. We fol- 
low here again our “bottom-up” exploration and provide a rather simple one. 
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though much more powerful than our previous StriQueL. Moreover, to explore 
a different direction than in previous sections, we introduce a slight variant in 
our semantics. We believe this variant also to be very promising for future work. 

The language |FO^(»)]nai- Until now, we proceeded as follows. Given the 
alphabet E, the language |UO(*)]„ai over E was formally defined, allowing 
constants from E in the formulas (and in their inputs), and with quantifiers 
ranging over E* . According to its definition, this language was given as input 
instances over E* . Let now some new symbol not in E, say be given. Con- 
sider FO{*) over AU{^}, but now, the input of a formula is an instance over 
E* only, that is, an instance in which # does not appear. We denote this lan- 
guage lFO*{u)j„at- 

The language FOr{*)- What we shall do now, is to define a simple syntactic 
extension of FOrr{*) (expressing a subset of |UO^(»)]„at)j which we allow 
particular subformulas having the property that a few designated quantified 
variables are allowed to range arbitrarily over (E U {#})*. We show however 
(in the same spirit as for FOrr{*)), based on the particular structure of these 
subformulas, that, given I, only a finite number of valuations can satisfy the 
formula (p defining the query, thus yielding a finite evaluation and a finite output. 

To simplify the presentation of the syntax, we use a “macro” that we call F. 
More precisely, a formula in the language may now contain as a subformula an 
expression F{- ■ •). These subformulas are added to FOrr{*) to yield the language 
FOr{*)- Note that, again, as for FOrr{*), the formal semantics is simply that 
of |UO^(*)]„at (be., with the full domain (i 7 U {#})*). Formally, the F{- ■ •) ex- 
pressions are replaced (“expanded”) by their corresponding FO"^{») subformula, 
so that strictly speaking the syntax of FOr{*) is a fragment of FO"^{»). 

Before introducing the formal details of the T operator, let us present its 
principles intuitively. In an expression F{x,y, one. step), x and y are strings, 
and one.step is an ordinary first order formula. This formula one.step defines 
a relationship between x and y that will have to be satisfied. With these three 
arguments, F will be doing the following. 

First, a string x#si#S2# . . . is built (call it z in the following). 

Then, it is verified whether one.step{x, si) holds, then one.step{s\, S2), . . ., until 
one.step{sri,y)- If all are true, then the formula F{x,y, one.step) is true. 

Now, how is it that the number of such “big” strings z, and the evaluation 
of the one.step{si, Si+i)’s, are both bounded? Intuitively, to achieve this, F gen- 
erates only Si’s of strictly increasing size (thus bounded by |y|), and forces also 
one.step to be finitely evaluable (in fact, it has even to be quantifier-free). 

Strictly speaking, this mechanism is encoded in a big “parameterized” first 
order formula, that we call F. By “parameterized” , we simply mean that some 
subformula in F{- ■ ■) will be the “parameter” one.step{- • •). (This kind of con- 
struction is analogous to axiom schemata in axiomatic logic.) For the sake of the 
presentation, we choose a very simple variant of F . Richer variants can easily be 
defined. 
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We define a formula schema r{x,y,onestep), where onestep has two free 
variables, and has to be built using only conjunctions and disjunctions of equality 
atoms (no negations, and no quantifiers),^ and the free variables of F are x,y. 
The variable x represents the “initial step” and y the “final step” in T. Examples 
of how to use r to build a FOr{*) formula are given below. 

F{x,y, one step) = 
no_#(a;) A no.^{y) A 
3z,u: (2; = #*x*#*m*#*2/*#)A 

\/zi,v, w : {substring{zi, z)Azi = #*r'*#*w»#A no-^{v) A no.-#- (w)) —*■ 
(3 ui,U2 (w = ui • V • U2 A (mi yf e V M2 yf e)) A onestep{v,w)) 

Let denote this FO"^{») formula pr, and call it the “expansion” of our F{- ■ ■) 
subformula shorthand. In the expansion the predicate no-=^ checks easily that 
its argument does not feature the # symbol, and substring is the obvious abbre- 
viation. The beginning of the last line says that at each step, each u is a strict 
substring of w. 

We extend the definition of range restriction to T, by adding one case to 
Definition 3 in Section 3.2. Notice that, with respect to range restriction, F is a 
“propagator” of range restriction, not a “generator”. That is, it does not bound 
variables by itself, but it transmits bounds from y to x. 

li p = Pi /\ F{x, y, onestep) then 

ii y ^ rr{pi)^ then rr{p) = rr{pi)] 

if ?/ € rr{pi)^ then rr{p) = rr{pi) U {x}. 

The language obtained is called FOr{»)- Its syntax is the syntax of FO{») 
extended with the F operator, and its semantics is that of when 

F is expanded. 

We give below the queries sameJength and parity in FOr{*)- Note that 
strictly speaking, the symbol (|; has to be simulated using only making the 
real formula heavier.® 



samedength{si, S2) = X = car{si) • car{s2) A y = si»<|:»S2 A F{x,y,ip), 
where w) = v = u\ • • U2 /\ w = U3 • Ui A 

(V ^^3 = Ml»a) A {\J U 4 = M 2 »a), 

^ We could also for instance define it such that all new variables introduced by onestep 
be forced to be substrings of 2: or y. Also, onestep could be allowed to be any safe 
formula, e.g. range-restricted. 

^ To be more precise, a formula of FOr{*) is one of EO’^(»): all F {. . .) subformulas 
used above as shorthands are replaced by the actual Pr(...) subformula. For instance 
our notation Jx,y : R{x,y) /\r{x,y,%li) denotes the formula Jx,y : R(x,y)/\ 

r {x ,y ,'i/j) • 

® This is done essentially by using two ^ symbols concatenated. 
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and the function car returns the first letter of its argument. 



parity{x) = r{e,x,4’), where ip{v,w) = \J w = v • a»b. 

{a,6}ci: 



The following theorem characterises FOr{*)- 
Theorem 10 Let ip be a formula in FOr{*)- Then 
L Winat is a string query. 

2. For any instance I, can he evaluated in space polynomial in the 

size of the longest string in the extended active domain eadom{I , p) . 

Crux. We briefly consider point 1. Let s be the longest string in eadom(I,ip). 
Let If be of the form ip{. . . , y, . . .) A F(x, y, onestep), where tpi- ■ • j 2/j ■ • ■) is an 
FOrr{*) formula®. As in the proof of Theorem 8 we will get a bound, say k, 
on the variable y. Then, we have that |<p]nat(.f) C eadcmC , where n < 1 + 
2 + . . . + |s|^. This is because of the following: The only “dangerous” variables 
are the quantified ones in the expansion of F. Now y is the “output” of F, 
i.e., the last word concatenated in the existentially quantified long derivation 
Bz : z = a; Si S 2 2/- As |si| < |s 2 | < ... < \y\ (because 

of the constraint imposed by F on onestep), we get the maximal size of n for 
any string in the range of |(^]„at(/). This shows that if y is bound to eadom^ for 
some constant k depending only on then F is safe because quantifiers need 
only to range over eadorrC. 

Note that, once replaced by its expansion, F may not be range restricted. 
However, we just showed above its actual safety (provided y is itself range re- 
stricted). In other words, we mix here a syntactic means (range restriction), and 
a semantically safe operator {F). 

5 Expressive Power — StriQueL vs Others 

In this section, we compare our languages with formalisms outside string query 
languages. We first compare the pure relational fragment of our languages with 
pure relational FO. We end this section by a note on the difficulty of the com- 
parison of FO{u) languages with formal languages. We however believe that 
such comparisons are a promising direction of research. 

String genericity vs relational genericity. The issue here is to determine 

(1) whether our string languages express more queries than relational FO, and 

(2) whether we can compute purely relational generic queries that are not in 
relational FO (recall Definition 1 and definitions of both genericity concepts in 
Section 2.1). The answer to question (1) is yes: we already saw in Section 2.1 
that there are string generic mappings that are not relational generic. In addition 



Note that y might be quantified in . . ,y, . . .) 
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we now have that such mappings can be expressed in our simplest semantics, 
namely the active domain semantics. The answer to question (2) is no. This last 
result has to be put in contrast with the case of FO{<), for which the order 
did bring new generic relational queries [16,1]. We denote by IJ-M the subset of 
mappings in M that are relational generic. 

Proposition 11 1. There is a formula ip in FO(*), such that |FO(»)]adom is 
not a relational generic query. 

2. iFOUon. = ^ lFO{*)Uom = ^ li"0(*)leado™^ 

Crux. Given FO{») formula p such that |</j]adom is a relational generic query 
we construct a FO formula ip, such that I'lpjadomil) = {pladomil), for any 
instance I. A case analysis shows that tjj and <p agree on instances where the 
active domain is composed of words of size 1. ^From this and the fact that p is 
relational generic, it follows that they must also agree on all models. This proof 
is inspired by techniques from [.3] . 

On the relationship with formal languages. A string query is called context- 
free if there exists a context-free language L C E* such that the following holds: 
When given as input a unary relation r, the query returns r H L. The same 
definition extends to other classes of languages. Here we only give a conjecture 
grouping several representative situations. 

Conjecture 1 1. The query parity (x) is rational hut not in |AO(*)]„at. 

2. The query <p{x) = x = • F • , where a and b are in E and i ^ j or 

j yf k, is context-free but not in |FO(»)]„at- 

Note that if we are allowed to use additional “marker-symbols,” as in the 
query language then computing the parity query is easy. These 

examples suggest that putting in correspondence fragments of and 

classes of formal languages may not be easy. 

6 Related Works 

Strings have been studied in databases, logic, formal language theory and com- 
putational complexity. In this section, we survey the main works in relationship 
with ours. We begin with databases, then turn to the others. Some definitions 
we used were introduced by other authors, as was pointed out in the text above. 
All the theorems that were explicitly stated above are novel. 

Databases. Present-day database query languages offer little support for string 
relations. For example, the Sequence Retrieving System (SRS) [10], which has 
gained popularity in molecular biology, does not allow the database administra- 
tor to draw links from one preformatted data file to another [11], but only on its 
atomic non-sequence fields. Because the majority of current relational database 
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management systems do not support application-specific data types such as se- 
quences, some molecular biology database designers have begun to move towards 
object-oriented database technology [12]. Another solution, strongly advocated 
in for instance [7], is to introduce such types as relational domains, as we have 
done. In any case, the string handling concepts introduced in this work are in no 
way specific to the relational model; indeed, they are being applied for querying 
sequences of complex objects from object-oriented databases as well [2]. 

Our language differs from such languages as SEQ [32,33], where sequence 
elements and their underlying order type are distinct domains, in that queries 
in our application areas are more oriented towards parsing-type tasks than for 
example computing moving averages on temporally sequenced data [32] . 

One of the early declarative database languages for string databases was 
Richardsons [27]. This language used the modalities of temporal logic for ex- 
pressing properties of strings. Each successive position in a string is seen to 
be the time wise “next” instance of that string. The temporal modalities lend 
themselves naturally to reasoning about strings. There are however well-known 
restrictions [38] to basic temporal logic. In particular, no recursion or iteration 
is achieved. 

The works closer to ours begin with Wang [15], and Ginsburg and Wang [14]. 
They extended the relational calculus with interpreted regular transducer func- 
tions. Such a function has n input strings, and the result is a regular finite state 
transduction on the inputs. The resulting language captures all r. e. sets (and 
sets in the arithmetical hierarchy) [15]. The drastic expressive power of the lan- 
guage results from the use of powerful interpreted functions. Of course, the 
transducer mappings of Wang and Ginsburg are intended to serve as semantic 
programming primitives, but using them may turn rather heavy from the point 
of view of programming comfort. 

Later, Grahne, Nykanen, and Ukkonen [17] proposed a string-extension of 
relational calculus based on modal logic. Their language has basically the same 
expressive power as that of Ginsburg and Wang. It has however the advantage 
that the string primitives do not have to be “plugged in,” they have a declarative 
semantics that can be coupled with relational structures. The string program- 
ming primitive is a declarative specification of a multitape finite state automaton. 
The multitape automaton have proven to be useful in pattern matching prob- 
lems [13,17]. The language has also been implemented [19]. A detailed study of 
safety and evaluability is done in [18]. 

Still later, Mecca and Bonner [24] used the interpreted functions of Ginsburg 
and Wang in a datalog-like language. Since Prolog with one function symbol 
already yields all r. e. sets [21], the Mecca-Bonner language has full Turing 
expressivity. In a series of sophisticated constructions [24] Mecca and Bonner re- 
strict the language to capture sets in for instance PTIME, and hyper-exponential 
time. In the Mecca-Bonner language the programming primitive for recursion is 
datalog rules. String operations are expressed through the transducer functions, 
and if the programmer wants to iterate or recurse over strings, she has to mix 
the datalog rules and transducer functions, and stay within the syntactic re- 
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strictions given by Bonner and Mecca. W.r.t. expressivity, Bonner and Mecca 
consider computing string functions, i.e., mappings from a single relation con- 
taining a single string, and returning a single string. 

In brief, compared to [14,17,24], we go back to the spirit of the “founding fa- 
thers” . For the string programming primitive, we return to using only •, simpler 
than all previous string programming primitives. And for the “host language”, 
we return to FO, simpler than [17,24]. We study in depth a low-complexity lan- 
guage, whereas the abovementioned authors explore very powerful ones (except 
some PTIME fragment with transducers in [24]). 

Logic, formal languages and computational complexity. The main direc- 
tion in logic seems to have been in showing undecidability of various fragments 
of the theories of concatenation. In the case of first order, this amounts to vari- 
ants of our |EO(*)]„at. It initiated with Quine [26] (see also [36]). Makanin [23] 
shows the decidability of the existential fragment of the theory of concatenation, 
with consequences on word equations. We used the recent presentation in [8], 
which contain other references; however, nothing seems to have been done w.r.t. 
computing the set of all solutions [30] (not only a yes/no answer), which is the 
output of our queries. 

From a pure formal languages point of view, Salomaa [29], Chap. Ill, presents 
an apparently isolated result around characterising type-0 (i.e., recursively enu- 
merable) and type-1 languages. 

In the classical marriage between strings and formal language theory [28,9], 
a model of a formula is a single string, not a whole collection of relations as in 
our study. As a consequence, issues, techniques and results are quite different. 

In computational complexity, Stockmeyer [34] shows that the polynomial 
time hierarchy can be characterised in terms of (polynomially bounded) quan- 
tification over string relations. 

In brief, in logic and related areas, to our knowledge, no study of a low power 
fragment as ours was undertaken. 

7 Conclusion 

Querying string databases has essentially focused until now on very powerful 
languages, i.e., variants of |EO(*)]„q(. Our main point in this paper was the 
proposal and in-depth study of a low complexity fragment of |FO(»)]„at. Its 
simplicity being, we believe, well-suited for “everyday” database querying and 
programming, we called it StriQueL, standing for String Query Language. 

More precisely, after having defined string queries, given the FO(») syn- 
tax, we considered the general language |EO(»)]„at and defined our goal: all 
string queries. Discouraged by an undecidability result we obtained, we pro- 
ceeded “bottom-up”. Were successively introduced: semantics restrictions of 
IJ^o(.)i 

nat [i"0(.)] 

adorn ; eadom 7 and [EO(*)]g„rfom'=); then a syn- 

tactic restriction, FOrr{*)- We argued that this last one might deserve the name 
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StriQueL — String Query Language. Its was shown that its complexity is in log- 
arithmic space, and that it does not express all string queries. To get more ex- 
pressive power, FOr{*) a fragment of was introduced. Some com- 

parisons with formal languages and pure relational languages were also proven. 

All languages considered (except and |T’0^(»)]nat) were shown 

to express only string queries. It remains open whether there is a string query 
not in |FO(*)]„at- 

Perspectives. First, an efficient operational semantics for StriQueL was not 
considered in this paper: the design of an adequate algebra is definitely relevant. 

Another direction will be the design and study of a more powerful variant 
of the F operator. A different direction, more in the “systematic” spirit of our 
StriQueL (or FOrri*)), will be around the following notion of domain indepen- 
dence (A-* represents all string of size at most Z): 

A string mapping (defined by ip in the semantics |AO(*)]„at) is said to 

be domain independent if, for each instance I, there exists a constant I, 

such that l(pjnat{I) = Mi:<i(^)- 

Of course, one issue is to compute the constant 1. In addition, some aspects 
of domain independence are rather involved in the relational case [20] so here, 
because of the nature of string genericity and the infinite domain E*, new tech- 
niques have to be developed. 

A parallel step will be to define analogous to fixpoint and while dealing with 
strings. One of our purposes in doing the present work was to gain a deeper 
understanding of the notion of string query; we believe these recursive languages 
can now be addressed. 

Another interesting further question is how our results correlate with query 
languages for lists. Is it possible to apply some of them to lists with arbitrary 
element type (typically having an infinite domain)? How can query languages 
for lists be adopted for strings? 

To close this perspective section, we believe the most interesting issue at this 
point is this: Is there an effective syntactic restriction of |AO(*)]„at yielding 
exactly the class of all string queries? 
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Abstract. Based on the recursion mechanism of the XML transfor- 
mation language XSL, the document transformation language TJTC is 
defined. First the instantiation is considered that uses regular 

expressions as pattern language. This instantiation closely resembles the 
navigation mechanism of XSL. For the complexity of relevant de- 

cision problems such as termination of programs, usefulness of rules and 
equivalence of selection patterns, is addressed. Next, a much more pow- 
erful abstraction of XSL is considered that uses monadic second-order 
logic formulas as pattern language (T>T/i™°°). If is restricted to 

top-down transformations {T>TC^^°), then a computational model can 
be defined which is a natural generalization to unranked trees of top- 
down tree transducers with look-ahead. The look-ahead can be realized 
by a straightforward bottom-up pre-processing pass through the docu- 
ment. The size of the output of an XSL program is at most exponential 
in the size of the input. By restricting copying in XSL a decidable frag- 
ment of T>TC™^° programs is obtained which induces transformations of 
linear size increase (safe T>TC^^°). It is shown that the emptiness and 
finiteness problems are decidable for ranges of programs and 

that the ranges are closed under intersection with generalized Document 
Type Definitions (DTDs). 



1 Introduction 

XSL [4,2] is a recursive XML [5,1,23] transformation language and an XSL pro- 
gram can be thought of as an ordered collection of templates. Each template 
has an associated pattern (selection pattern) and contains a nested set of con- 
struction rules. A template processes nodes that match the selection pattern 
and constructs output according to the construction rules. The transformation 
starts at the ‘root’ of the input document and the construction rules specify, by 
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means of construction patterns, where in the XML document the transforma- 
tion process should continue. In this paper we define a document transformation 
language {T>TL) based on the recursion and navigation mechanism embodied in 
XSL. 

As is customary [21,16], we use an abstraction of XML documents that fo- 
cuses on the document structure and consider a document as a tree. Such a tree 
is ordered and unranked (consider, e.g., a list-tag: the number of list entries is 
unbounded, which means that the number of children in the corresponding tree is 
unbounded). Figure 1 shows an XML document and the corresponding tree. In 
our notation the tree shown there is the string product(sales(dom(a)dom(b) 
dom(c))sales(dom(d)f or(e)f or(f ))). To enhance the expressiveness of Docu- 
ment Type Definitions (DTDs), which are modeled by extended context-free 
grammars, we also use a tree notion. Indeed, we define generalized DTDs as the 
tree regular grammars defined by Murata [18]. 



<product> 

<sales> 

<domestic> a </domestic> 
<domestic> b </domestic> 
<foreign> c </foreign> 

</ sales> 

<sales> 

<domestic> d </domestic> 
<foreign> e </foreign> 
<foreign> f </foreign> 

</ sales> 

</product> 



product 




sales 



sales 




dom dom for domforfor 
a b c d e f 



Fig. 1. Example of an XML document and the corresponding tree representation 



First, we study the instantiation of T>TC that uses regular path ex- 

pressions as selection and construction patterns. This instantiation closely resem- 
bles the navigation mechanism of XSL. We consider various decision problems 
for programs. An important drawback of XSL is that programs do not 

always terminate. This is due to the ancestor function which allows XSL pro- 
grams to move up in XML documents. We show that it is EXPTIME-complete 
to decide whether or not an program terminates on all trees satisfying 

a generalized DTD. Further, we consider optimization problems, like usefulness 
of template rules and equivalence of selection patterns. 

Next, we study T>TC with monadic second-order logic (MSO) formulas as 
pattern language (T>T£™’*°) and focus on the natural fragment of T>T£™®° that 
induces top-down transformations. This means that the construction rules can 
only select descendants of the current node for further processing. Consequently, 
programs can no longer move up in the document and will always terminate. We 
denote this fragment by £>T£™®°. We define a computational model for £>T£™®°: 
the top-down tree transducer with look-ahead. This is a finite state device ob- 
tained as the natural generalization of the usual top-down tree transducer [24,8] 
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over ranked trees. The basic idea of going from ranked to unranked trees is 
the one of Briiggemann-Klein, Murata and Wood [3]: replace recursive calls by 
regular (string) languages of recursive calls. We show that these transducers cor- 
respond exactly to VTC'^^° programs. As in the ranked case the look-ahead used 
by the transducer can be eliminated by first running a bottom-up relabeling on 
the input tree. This means that the input tree has to be processed only twice in 
order to perform a transformation: first a bottom-up relabeling phase and then 
the top-down transformation (without look-ahead). 

Unfortunately, the ranges of programs can in general not be de- 
scribed by (generalized) DTDs. For a given (or program this is 

even undecidable. We show that two relevant optimization problems for 
programs are decidable: (1) whether or not the range of a program is 

empty, and (2) whether or not the range of an program is finite. Further, 

we show that the class of output languages of programs is closed under 

intersection with generalized DTD’s: given a program P and a gener- 
alized DTD T>, there always exists a program P' that only transforms 

an input tree when the result belongs to T>. 

Since XSL programs can select nodes of an input tree several times (“copy”), 
the size of the output trees can be exponential in the size of the input trees. 
However, the original purpose of XSL was to add style specifications to XML 
documents.^ Most XSL programs, therefore, do not change the input document 
very drastically. Hence, it makes sense to focus on transformations where the 
size increase is only linear. An obvious, but rather drastic way to obtain this, 
is to simply disallow copying of subtrees. ^From a practical viewpoint, however, 
it is desirable to allow (some restricted type of) copying (see the example in 
Section 7.2). We define a dynamic notion that essentially requires that each 
subtree can only be processed (and hence copied) a bounded number of times. 
We call T>T£™®° programs that are bounded copying safe. Consequently, a safe 
program runs in time linear in the size of the input tree. Although safeness is a 
dynamic notion we show that it is nevertheless decidable. 

2 Preliminaries 

2.1 Trees and Forests 

For fc € N, [fc] denotes the set {l,...,/c}; thus [0] = 0. We denote the empty 
string by £. In what follows let S be an alphabet. The set of all (resp. nonempty) 
strings over E is denoted by E* (X’+, respectively). Note that * and -I- are also 
used in regular expression (see the examples in Section 2.2) . For a set S we denote 
the set of all regular languages over S by Reg(S'). For a string w = ai ■ ■ ■ Un and 
i € [n] with oi, . . . , a„ G A we denote by w{i) the i-th letter a^. 

The set of unranked trees over E, denoted by Te, is the smallest set of strings 
T over E and the parenthesis symbols ‘(’ and ‘)’ such that for a G E and w G 7J, 

^ There are now new proposals to make XSL into a fully fledged XML query language, 
see, e.g., the proposal by Bosworth [2]. 
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a{w) is in T. For cr() we simply write a. In the following, when we say tree, we 
always mean unranked tree. Let S' be a set. Then Ts{S) denotes the set of trees t 
over H U S such that symbols of S may only appear at the leafs of t. The set 
of unranked forests over E is denoted by Ts] furthermore, = Tf,{S). 

For every tree t e Tj;, the set of occurrences (or, nodes) of t, denoted by 
Occ(t), is the subset of N* inductively defined as: if t = cr(ti ■ ■ - tn) with a £ E, 
n > 0, and e Ts, then Occ(t) = U | u G Occ(tj)}. 

Thus, the occurrence e represents the root of a tree and ui represents the i-th 
child of u. For every tree t G Ts and every occurrence u of t, the label of t 
at occurrence u is denoted by we also say that t[u] occurs in t at node u. 
Define rank(t, u) = n, where n is the number of children of u. The subtree of 
t at occurrence u is denoted by t/u. The substitution of a forest w G TJ at 
occurrence rt in t is denoted by <— w]. Formally, these notions can be defined 
as follows: t[e] is the first symbol of t (in E), t/e = t, t[e <— w] = w, and if 
t = a{ti ■ ■ -tk), i G [A:], and u G Occ(ti), then t[iu] = ti[u], t/iu = U/u, and 
t[iu <— w] = cr(ti ■ ■ ■ ti[u w] ■ ■ ■ tk). Note that t[u <— w] is in general a forest 
(for M ^ £ it is a tree) . 

2.2 DTDs and Generalized DTDs 

We model a DTD [5] as an extended context-free grammar. This is a context-free 
grammar that allows regular expressions on the right-hand side of productions. 
To illustrate the shortcomings of DTDs we recall the example from Ludascher 
et al. [16,21]. Consider the following DTD G 

dealers — >■ dealer* 
dealer — > ad* 

ad — > usedcar_ad -|- newcar.ad 

that models a list of dealers with advertisements for new and used cars. Note 
that the right-hand sides are regular expressions, i.e., * and -I- mean Kleene-star 
and set union, respectively. We now want to specify those derivation trees of 
the above grammar where each dealer has at least one used car ad. This cannot 
be specified with DTDs without changing the structure of the derivation trees. 
Therefore, we will define generalized DTDs as tree regular grammars introduced 
by Murata [18]. 

Definition 1. A generalized DTD is a tree regular grammar D = {N, A, S, P), 
where N and A are alphabets of nonterminals and terminals, respectively, S G 
N is the start nonterminal, and P is a finite set of productions of the form 

A ^ t, where t G P/i(Reg(iV)), and if A = 5, then t should be a tree with 

t[£] ^Reg(iV).2 

The language generated by D, denoted by L{T>), is defined in the obvious way: 
L{T>) = {t £ T/^ \ S =>1, t}, where the derivation relation induced by T> is 

This is just a technicality to ensure the definition of trees only. 
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defined as =^>-d C 2 if there is a node u in labeled by a regular language K 
and ^2 is obtained from by substituting u by the right-hand sides of X\- 
. ,X„-productions, where X\ ■ ■ ■ X^ is a string in K . Here, an X-production is 
a production of the form X ^ t. 

Example 2. The following generalized DTD defines those derivation trees of G 
where all dealers have at least one used car ad. All strings that start with capital 
letters are nonterminals; all others are terminals; Dealers is the start symbol. For 
convenience, we denote the regular languages at leaves by regular expressions. 

Dealers — >■ dealers(D) 

Dealer — > dealer(t7) 

UsedAd — > ad(usedcar_ad) 

NewAd — > ad(newcar_ad), 

where D and U are the regular languages given by the expressions Dealer* and 
(UsedAd + NewAd)*UsedAd(UsedAd + NewAd)*, respectively. 

Generalized DTDs have the same expressive power as tree automata on unranked 
trees [31, which are essentially the specialized ltd’s of Papakonstantinou and 
Vianu [21,3]. 



3 XSL 

In this section we give some examples of XSL programs, which will motivate the 
definition of HT/1. XSL programs contain more features than we describe here.® 
We will focus on the navigational and restructuring ability of XSL. 

Example 3. In Figure 2 an example of an XSL program P is shown. Figure 3 
contains the output produced by processing the XML document in Figure 1. 
The program P contains three templates. A template consists of a selection 
pattern, which equals the match attribute, and of construction rules, which equal 
anything between <xsl : template . . .> and </xsl : template>. The translation 
process starts at the root of the document. The selection patterns determine 
which template should be applied at the current node. The construction rules 
describe the output and contain construction patterns. Construction patterns 
are the patterns that equal the select attribute in xsl : apply-templates; they 
select the nodes with which the transformation process should continue. If no 
construction pattern appears in a template, then all children of the current node 
are processed. 

In P the templates are applied at product, foreign and domestic nodes 
respectively. In patterns, the construct / denotes child of and II denotes de- 
scendant of . The pattern sales/domestic then selects all domestic grandchil- 
dren of the current node whose parent is labeled with sales, and the pattern 

® Although the XSL working draft [4] is still unstable and sometimes remains quite 
vague. 
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<xsl : stylesheet> 

<xsl: template match="product"> 

<0UT> 

<TABLE> 

<xsl : apply-templates select="sales/'foreign"/> 

</TABLE> 

<TABLE> 

<xsl : apply-templates select="sales/domestic"/> 

</TABLE> 

</0UT> 

</xsl : template> 

<xsl:template match="f oreign"> 

‘ <xsl : apply-templates/> ’ 

</xsl : template> 

<xsl : template match="domestic"> 

<xsl : apply-templates/> 

</xsl : template> 

</xsl : stylesheet> 

Fig. 2. Example of an XSL program 

sales/'foreign selects all descendants of sales-labeled children of the current 
node which are labeled foreign. If several nodes are selected by a pattern, then 
they are processed in document order [5], which is the pre-order of the docu- 
ment (tree). The construction rules of the first template create two ‘tables’ and 
put them between OUT ‘nodes’; in the first table all nodes that match the pat- 
tern salesz/foreign are selected for further processing; in the second table all 
nodes that match the pattern sales/domestic are selected for further process- 
ing. Built-in template rules make sure that text nodes (like a, b,. . . in Figure 1) 
are copied through. The second template rule puts them between quotes, the 
third one doesn’t. 

XSL contains built-in template rules to allow recursive processing to continue in 
the absence of a successful pattern match by an explicit rule in the style sheet. 
In our language T>TC we will not consider built-in rules as they can easily be 
simulated. Consequently, 'DTC programs will not transform every input tree. We 
will show (in the proof of Theorem 18) that the domain of a T>TC program can be 
defined by a generalized DTD (more precisely, we show this for the ‘descendant’ 
case with MSO patterns, VTC^^°). 

XSL essentially allows arbitrary regular expressions as selection and con- 
struction patterns. As is illustrated by the next example, construction patterns 
can also select ancestors as opposed to descendants. 

Example 4- The function ancestor(p) selects the first ancestor of the current 
node that matches pattern p. For example, ancestor (chapter)/title will se- 
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<OUT> 

<TABLE> ‘c’ ‘e’ ‘f’ </TABLE> 

<TABLE> a b d </TABLE> 

</OUT> 

Fig. 3. The output of the program in Figure 2 on the data in Figure 1 

lect the title children of the first ancestor of the current node that is a chapter. 
This feature can cause undesirable behavior. Indeed, in Figure 4 an XSL program 
is shown that does not terminate on the XML document in Figure 1. 

<xsl : stylesheet> 

<xsl: template match="product"> 

<TABLE> 

<xsl : apply-templates select="ancestor (product) "/> 

</TABLE> 

</xsl : template> 

</xsl : stylesheet> 

Fig. 4. Example of an XSL program that does not terminate 

If a node matches several template rules, then the rule with the highest priority 
is taken. The priority of a template rule is specified by the priority attribute 
of the rule. 

The XSL working draft informally mentions mode attributes to allow to treat 
same parts of the document in different ways. A simple example which needs 
modes is the transformation of a list of items into two lists of corresponding 
serial numbers and prices (see Section 7.2). In 'D'TC we model this by states. 



4 The Document Transformation Language T>'TC 

We now define HT£ without specifying the actual pattern language. Intuitively, 
a unary pattern selects nodes of trees, while a binary pattern selects pairs of 
nodes of trees. Formally, a unary pattern p over X is a subset of Ts x N*; a 
binary pattern p' over X is a subset of 7}; x N* x N*. Let s € and let 

u,v G Occ(s) be nodes of s. If (s,rt) € p (respectively, (s,u,u) € p'), then we 
say that u matches p (respectively, {u,v) matches p'). Let Q he a, finite set of 
states. A construction function f over Q and X is a function from Q to the set of 
binary patterns, such that Vg, q' G Q ■ q q' ^ f{q) C\ f{q') = 0-, this condition 
expresses that in a construction function all binary patterns should be disjoint. 
The set of all construction functions over Q and X is denoted by CF(Q, X). 

Definition 5. A T)TL program is a tuple P = (X, A, Q, go, R, ^), where 

— X is an alphabet of input symbols; 

— A is an alphabet of output symbols; 
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— Q is a finite set of states (modes); 

— qo £ Q is the initial state; 

— i? is a finite set of template rules of the form (g, p, t) where q £ Q, pis a, unary 
pattern over E (called seleetion pattern) and f is a forest in f^zi(CF((5, if)); 
if q = qo, then t is required to be a tree such that t[e] ^ CF(Q, E)f 

— ^ is a total order on i?, called the priority order. 

We are now ready to define the transformation relation induced by P. Intu- 
itively, P starts processing in its initial state qo at the root node e of the input 
tree s. This is denoted by go(£^)- Now the highest priority template rule {qo,p,t) 
for which e matches p is applied. This means to replace qo{s) by t, in which 
each construction function / is replaced by a sequence gi(?;i) . . . qmivm), where 
each Vi is a node of s selected by the pattern f{qi), i.e., f{qi){s,e,Vi) holds, 
and vi, . . . ,Vm are in pre-order. The transformation process then continues in 
the same manner at these nodes. 

Formally, the transformation relation induced by P on s, denoted by =^p,s, is 
the binary relation on '7 ^uq(Occ(s)) defined as follows. For £ Tauq{Occ{s)), 
f ^p,s f', if there is a node u £ Occ(f) and a template rule r = {q,p,t) in R 
such that 

1. f/u = q{v) with q £ Q and v £ Occ(s), 

2. r = max^{(q,p',t') £ Rj (s,v) £ p'}, 

3. f <— t0], where 0 denotes the substitution of replacing every con- 

struction function / G CF((5, E) by the forest qi{vi) . . . Prui^m), where 

- {vi,...,Vm} = {u\ (s, V, u) £ U,gQ /(g)}; 

— for i £ [m], (s,v,Vi) £ f{qi)', and 

“ <pre • • ■ <pre Vm- Here, <pre denotes the pre-order of the tree. 

The transformation realized by P, denoted by Tp, is the function {(s,t) £ Ts x 
Ta I go(e) ^*ps ^}- Here, denotes the transitive closure of ^p,s. We give 
an example of a I?T£-transformation in the next section. 



5 Regular Expressions as a Pattern Language for 'D'TC 

We now define unary and binary patterns that closely resemble the patterns 
issued by XSL. Denote by L(r) the language defined by the regular expression r. 
For a tree t and v,v' £ Occ(t), path(t,v,v') denotes the string formed by the 
node labels on the unique path from v to v' (the labels of v and v' included). 

Unary patterns are defined as p(x) = u ■ x ■ d, where u and d are regular 
expressions and a; is a variable {u stands for up and d stands for down). For a 
tree s and a node v: (s, v) £ p iS path(s, e, v) £ L{u) and there is a leaf v' in s/v 
such that path(s,v,p') £ L{d). 

Binary patterns are also defined by regular expressions. We have up/down 
patterns and down patterns: 

This is just a technicality to ensure tree to tree translations. 
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— An up/down pattern is of the form p{x,y) = x ■ u ■ d ■ y, where u and d 
are regular expressions, and x and y are variables. For a tree s and nodes v 
and v' : {s,v,v') S p iff there exists an ancestor v” of v and v' such that 
path(s, f") G L{u) and path(s, n') G L{d). 

— A down pattern is of the form p{x, y) = x-d-y, where c? is a regular expression, 
and X and y are variables. For a tree s and nodes v and v': (s, v, v') G p iS v' 
is a descendant of v and path(s, i;, i;') G L{d). 

Denote the instantiation of VTC with the above patterns by 

Example 6. The XSL program in Figure 2 can be described in as follows. 

Let Q = {go}i ^ and A consist of all ASCII symbols, and let R consist of 
the following template rules. (In all the rules the construction function / is 
represented directly by the binary pattern f{qo)-) 

{qo , A*product • x ■ product A* , <OUT>(<TABLE>(x • product salesA* foreign • y) 

<TABLE>(x • product sales domestic-?/))) 
{qo, A*f oreign • x ■ f oreignA*, ' {x ■ f oreignA ■?/)’) 

{qo, A* domestic • x ■ domesticA*, {x ■ domesticA • y)) 

The priority of the rules does not matter in this case and is therefore omitted. 
We also refrained from specifying the built-in XSL rules that just copy through 
the string content. 

Let us now take a look at an example of the computation of a program. 

Example 7. Consider the simple program P with the single template rule {qo. A* • 
X ■ A*, cr(/)) with /(go) = x ■ EE* ■ y, where A = {a}. Intuitively, P selects all 
proper descendants of the current node. Recall that the nodes that match a pat- 
tern are selected in pre-order of the input tree; this is true for all T>PC programs. 
Let us now consider the monadic input tree s = a{a{a{a))). We start with qo{s) 
(recall that e denotes the root of s). Now we apply the (only) go-rule that matches 
at the root of s. In its right-hand side the pattern f{qo) = x-EE*-y has to 
be replaced by the sequence of (pre-order) nodes that match x ■ EE* ■ y, where 
X = e (here, this will be all proper descendants). The computation proceeds as 
shown in Figure 5. Similarly the reader may imagine how the derivation of P for 
non-monadic input trees looks like. 

We now establish the complexity of some relevant decision problems for 
following let I? be a generalized DTD. We show that 
deciding whether or not a program terminates cannot be decided effi- 

ciently and provide an EXPTIME algorithm. 

Theorem 8. Deciding whether or not a program terminates on every 

tree in L{T>) is EXPTIME-compfete. 

Sketch of Proof: EXPTIME-hardness is shown by a reduction from the circu- 
larity problem of attribute grammars which is EXPTIME-complete [15]. An at- 
tribute grammar consists essentially of an underlying context-free grammar Go , 
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Fig. 5. A computation of P for input tree s = a{a{cr{a))) 



a set of attributes and a set of semantic rules that define the attribute values. 
We define the generalized DTD T> such that it defines all abstract derivation 
trees of Gq (cf., e.g., Lemma 5.5 in [17]); every node in such a tree is labeled 
by a production of Gq (plus a ‘sibling’ number). The values of the attributes 
of a node (as defined by the semantic rules) depend on attributes of the parent 
and on attributes of its siblings. Attributes are, hence, defined locally. We take 
the attributes as the set of states of P. The program P now simulates all runs 
through the dependency graph of G on an input tree s as follows. (W.l.o.g. we 
can assume that all grammar symbols have the same set of attributes.) At the 
root of s, P selects every node in every state. At a node u the state a selects 
all nodes that the attribute occurrence a at rt depends on. If a is a synthesized 
attribute, then this is determined by the semantic rules of p (the label of u) and 
otherwise (a inherited) it is determined by the semantic rules of the father of u. 
Clearly, P arrives at the same node twice in the same state (and hence does not 
terminate) iff an attribute occurrence depends on itself, i.e., iff G is circular. 

We will reduce the termination problem to the emptiness problem of two-way 
non-deterministic tree automata with string regular look-around with negation 
(2NTA®’^). Intuitively, such an automaton can walk non-deterministically in two 
direction through the input tree and can check at any node whether a unary 
p-j-^reg holds or not (here is where the negation comes in). We omit 

the formal definition of the automaton and the rather involved proof of the next 
lemma which uses techniques developed by Neven and Schwentick [19]: 

Lemma 9. Emptiness of 2NTA^^ s is EXPTIME-compfete. 

Let P = (A, Z\, Q, (7o, R, ^) be a program and V a generalized DTD. We 

say that a node v oi a, tree t is g-reachable for a state q if there exists a tree 
s G TAuQ{Occ{t)), such that qo{e) s and s/u = q{v) for some u G Occ(t). 
If P does not terminate on all trees in L{T>), then there exists a tree t and 
V G Occ(t) such that v is q-reachable and P returns at node v in state q. We 
call such nodes attractors (as the program always returns to them). We will now 
construct a 2NTA®’' A over the alphabet A U (A x Q) that accepts a tree t iff t 
has only one node labeled with an element of A x Q (say t\u] = (cr, q)) and P 
does not terminate on t' , where t' is obtained from t by changing the label of u 
into a. The automaton works as follows: 

1. A first checks whether the input tree belongs to L{V) (this requires a poly- 
nomial number of states in the size of A); 
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2. A checks whether there is exactly one node v with t[ti] S U x Q; this only 
needs a constant number of states; in the rest of the computation A will 
check whether v is an attractor; 

3. A checks whether v is q-reachable: 

(a) A starts at the root and memorizes state go; 

(b) suppose A arrives at a node u, if u is labeled with an element oi E x Q 
(say (ct, g)) and A has memorized q, then goto (4) otherwise A deter- 
mines which rule P should apply at u (note that A can do this without 
leaving u, it can use its string regular look-around to match unary pat- 
terns and negations of unary patterns: A guesses a rule checks whether 
the selection patterns matches at u and whether all rules of higher pri- 
ority do not match); when A has determined which rule to apply it also 
knows which binary patterns have to be used to select the nodes that 
have to be processed next; 

(c) A non-deterministically picks one of these binary patterns, memorizes 
the associated state, non-deterministically runs to some node u' and 
checks whether it satisfies the binary pattern (for both up and down 
patterns this can be done while walking to u: no use of look-around is 
needed); go to (3b) with u = u'; 

This needs a number of states polynomial in the size of P. 

4. Now A does the same starting from v and accepts if it returns again at v in 
state q. This needs a number of states polynomial in the size of P. 

From Lemma 9 the result now follows. □ 

We next consider optimization of programs. A template rule r of a 

VTC program P is useful w.r.t. T>, when there exists a tree t in L(V) such that P 
uses r at some node of t. A proof of the following proposition is similar to the 
proof of Theorem 8 (see also [9]). 

Proposition 10. Given a program P and a template rule r of P, de- 

ciding whether r is useful w.r.t. T> is EXPTIME-compfete. 

A template rule r of a T>PC program P is utterly useless when no node of a tree 
in L{T>) matches the selection pattern of r. 

Two unary patterns p = u ■ x ■ d and p' = u' ■ x ■ d' are equivalent w.r.t. T> if 
for any tree t G L{T>) and any node v of t\ {t, ?;) G p iff {t, v) G p' . If L(V) = 
then p and p' are equivalent iff L{u) = L{u') and L{d) = L{d'). This problem is 
known to be PSPACE-complete [22]. We show that it remains in PSPACE for 
arbitrary generalized DTDs. Two binary patterns p and p' are equivalent w.r.t. T> 
if for any tree t G L{V) and any nodes v,v' of t: {t,v,v') G p iff {t,v,v') G p'. 
We obtain: 

Proposition 11. 1. Given a template rule r of P, it is decidable in PTIME 

whether or not r is utterly useless w.r.t. T>. 

2. Deciding whether or not two unary patterns are equivalent w.r.t. a 

generalized DTD is in PSPACE. 
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3. Deciding whether or not two binary patterns are equivalent w.r.t. a 

generalized DTD is in P SPACE. 

Sketch of Proof: (1) Let I? be a generalized DTD. Construct the non-deter- 
ministic finite-state automaton (NFA) with the following property can 

be constructed (details omitted): w G if and only if there exists 

a tree t G L{T>) such that w is a branch of t (i.e., w equals the sequence of 
symbols on a path from the root to a leaf). Moreover, the size of is 

polynomial in the size of D and can be constructed in time polynomial 

in the size of V. For an NFA M define the language M ■>— := {wia^aw 2 \ 

w\aw 2 G L{M)}. For two languages L and L', we denote by the language 
{wa^aw' I wa G L, aw' G L'}. The pattern p = u ■ x ■ d is utterly useless if 
^ n (L{u)^L{d)) = 0. By standard techniques the latter can be 
shown to be in PTIME. 

(2) Let p = u ■ X ■ d and p' = u' ■ x ■ d' he two unary patterns, and let 

2? be a generalized DTD. They are equivalent with respect to D iff for ev- 
ery Wi#W 2 G ^ #, Wi#W 2 G L{u)#L{d) ^ Wi#W 2 G L{u')#L{d'). 

I.e., L{u)#L{d) n ^ # = L{u')#L{d') C ^ By standard 

techniques the latter can be shown to be in PSPACE. 

(3) Let p{x,y) = x ■ u ■ d ■ y and p{x,y) = x ■ u' ■ d' ■ y he two binary 
patterns, and let I? be a generalized DTD. For a tree t and v,v' G Occ(t) 
denote the set of common ancestors of v and v' by Anc(u,u'). Define the string 
language I?hook over i7U{#} as {path(t, u)#path(t, u, u') | t G L{T>),v,v',u G 
Occ(t), M G Anc(t>, fO}- Clearly, p and p' are equivalent with respect to D iff for 
every wi#W 2 G 2?hook, wi#W 2 G L{u)#L{d) ^ wi#W 2 G L{u')#L{d'). 

It can be shown that 2?hook is regular (details omitted). Moreover, the size of 
the NFA that accepts 2?hook is polynomial in the size of D and can be constructed 
in time polynomial in the size of D. 

Now, p and p' are equivalent with respect to V iff L{u)^L{d) C T’hook = 
L{u')^L{d') n 2?hook- By standard techniques the latter can be shown to be in 
PSPACE. □ 

6 MSO as a Pattern Language for Tf'TC 

We now consider a much more powerful instantiation of DTC, where we use 
monadic second-order logic (MSO) formulas as a pattern language. 

A tree t G can be viewed naturally as a finite relational structure (in the 
sense of mathematical logic [7]) over the binary relation symbols {E, <} and the 
unary relation symbols {Oa- \ a G E}. The domain of t, viewed as a structure, 
is the set of nodes of t. The edge relation E in t is the set of pairs {u, ui), where 
u G Occ(t) and i G [rank(t,M)]. The relation < in t is the set of pairs (ui,uj), 
where u G Occ(t), i,j G [rank(t, u)] and i < j. Finally, the set in t is the 
set of cr-labeled nodes of t. MSO allows the use of set variables ranging over 
sets of nodes of a tree, in addition to the individual variables ranging over the 
nodes themselves as provided by first-order logic (see, e.g., [7]). The satisfaction 
relation |= is defined in the usual way. 
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Let p = p{x) and p' = p' (x^ y), where p and p' are MSO formulas. Then for 
a tree s and nodes u and v: 

— (s, m) € p iff s 1= p[u\, and 

— (s, M, ri) G p iff s 1= p[u, v]. 

Denote the instantiation of VTC with MSO patterns by Clearly, any 

257 '£reg simulated by a program. 

7 Top-Down Document Transformations 

In the remainder of this paper we study the natural fragment of VTC™^° that 
induces top-down transformations. Programs can, hence, no longer move up in 
the document and will always terminate. 

Definition 12. A program P = {S, A, Q, go, R, ~<) belongs to T>T£™° 

if for every construction function / in a template rule in R, every q € Q, and 
every input tree s with nodes u and v: if (s,u,v) € f{q), then u ^ v and t; is a 
descendant of u. 

Note that is a decidable fragment of VTC^^° . 

7.1 A Computational Model 

In this section we define top-down tree transducers with look-ahead which work 
over unranked trees (for short TOV^s). We first define the look-ahead which 
consists of forest regular languages as defined by Murata [18]: 

Definition 13. A non- deterministic bottom-up forest automaton (NBFA) is a 
tuple T = {Q,S,F,h), where Q is a finite set of states, A is an alphabet, 
F G Reg(Q), and h is a mapping S x Q ^ Reg(Q). Define the semantics of T 
on a forest t inductively as follows: for tr G A and n > 0, 

- T(a(ti ■ ■ ■ tn)) = {g I gi ■ ■ ■ <7n G g) and g* G T(ti) for i G [n]}, 

- T{ti • • • t„) = {gi . . . g„ I gi G T{ti) for i G [n]} 

A forest t is accepted by T if T{t) n F yf 0. A set of forests is forest regular if it 
is accepted by an NBFA. 

Note that the forest regular languages are also obtained by adapting Definition 1 
to forests, i.e., by simply dropping the condition on the start symbol. 

For a tree t, we denote with t the tree obtained from t by replacing the root 
symbol a = t[e] by a. A forest obtained from t\ - ■ - tn by changing a tj into tj 
for one j G [n] is a pointed forest over A. A pointed forest regular language over 
A, is a forest regular language of pointed forests over A. For a forest t and a 
node vj, we denote with point(t, vj) the pointed forest si • • • sj-iSjSj+i ■ ■ ■ Sn, 
where Si = t/vi for i = 1, . . . , n; by convention point (s, e) denotes s. 

We now define TOV^s. The main difference to usual top-down tree trans- 
ducers (over ranked trees) is the following. In the ranked case the right-hand 
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side of a rule may contain (at leaves) recursive calls of the form q(xi), denoting 
the transformation of the ith subtree of the current node in state q. But i is 
unbounded in the case of unranked trees. Similar to (unranked) tree automata 
we use, instead of q{xi), regular string languages L over states (plus the special 
symbol 0). Intuitively, a string w = qQq' in L means to process the first subtree 
of the current node in state q, omit the second, and process the third in state q' 
(i.e., the zth position w{i) corresponds to Xi). 

Definition 14. An unranked top-down tree transducer M with look-ahead is a 
tuple {Q, S, A,qo, R), where Q is a finite set of states, S and A are alphabets 
of input and output symbols, respectively, go € Q is the initial state, and i? is a 
finite set of rules of the form q{a{- • •)) ^ C where ( is in fF^(Reg(QU{0})) 

and is a pointed forest regular language over E; if q = qg, then C is required 
to be a tree and C[£] ^ Reg(Q U {0}). 

For a forest ( € lF/i(Reg((5 U {0})) we denote by ROcc(C) the set of all leaves of 
^ which are labeled by elements in Reg(Q U {0}), i.e., ROcc(C) = {p & Occ(^) | 
C[p] G Reg(QU {0})}. 

A rule r of the form g(cr(- • •)) ^ C {F) is called (g, cr)-rule, ^ is denoted by 
rhs(r) and F is denoted by Fr- For q G Q, a G S, and /c > 0 let rinsM(g, cr, fc) 
denote the set of k-instances of {q,a)-rules, that is, the set of all pairs {r,Lp), 
where r is a (g, cr)-rule in R and g? is a mapping which assigns to every p G 
ROcc(rhs(r)) a string of length k in the regular language rhs(r)[p]. If for every 
(r, Lpi), (r, ip 2 ) G rinsM(g, cr, k), ip\ = p> 2 , and for all distinct (g, CT)-rules r and r', 
Fr n Fri = 0, then M is deterministic. If not stated otherwise every T OV^ will 
be deterministic. 

We are now ready to define the derivation relation realized by M. Intu- 
itively, M starts processing in its initial state go at the root node e of the input 
tree s. This is denoted by qo{e). Now a (go, cr) rule r of M can be applied, where 
cr = s[£] and point(s,£) = s G Fr- This means to replace qo{s) by the right-hand 
side of r and to replace in r every regular language by the correct string of state 
calls (by “state call” we mean trees of the form q{i)). The latter is done by 
choosing with r a mapping ip such that (r, (p) G rinsM(gO) cr, k) (for deterministic 
TOV^s there is at most one such p for each r), where k is the rank of the root 
of s. 

Recall that for an occurrence p G ROcc(rhs(r)), p{p) is a string w of length k 
over Q and 0, and that w{i) denotes the f-th letter of w. For a string w, w[si • • • Sfc] 
denotes the forest w(l)(si) • • • w{k){sk), where 0(t) = e for all trees t. Formally, 
the derivation relation realized by M on s, denoted by ^m,s, is the binary 
relation on 7Quzi(Occ(s)) defined as follows. For G 1 ?qu/i(Occ(s)), ^ ^ 
iff there is an occurrence u G Occ(f) such that 

1. ^/u = q{v) with q G Q and v G Occ(s), 

2. there is a (r, p) G rinsM(g, cr, k), where k = rank(s, v) such that point(s, v) G 

Fr, f = ^[u <— C], and f = rhs(r)[p ^ p{p)[vl ■ ■ -vk] \ p G ROcc(rhs(r))]. 

The transformation realized by M, denoted tm, is {(s, t) G 7i;x7^ | qo{e) t}. 
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Example 15. Recall the program from Example 7. Now consider the 

TOV^ M consisting of the states qo, q and of the following two rules. ^ 

qo{(J {■■■)) ^ cr{q*) {F) 

q{a{---)) ^a{q*)q* (F), 

where F is the set of all pointed forests. Again consider the input tree s = 
cr(cr(cr(cr))). In Figure 6 the corresponding derivation by M is shown. Note 




111 

Fig. 6. Derivation of M on the input tree s = cr(a((j(a))) 



that M realizes the same transformation as P, even though the trees in the 
derivation of M are different from those of P. While in the VTC'^^° program P 
one single pattern can select all descendants, the transducer M has to select 
these patterns while it moves top-down symbol by symbol through the input 
tree. 

We obtain the following equivalence (proof omitted): 

Theorem 16. The transformations defined by programs are exaetly 

those eomputed by deterministic POV^s. 

The above Theorem states that TOV^s can be used as a natural implementation 
model for We next refine this result by showing that the look-ahead 

can be dispensed with. 

Theorem 17. Every T>PCff^° program can be implemented by a bottom-up re- 
labeling followed by a deterministic POV^ without look-ahead. 

Essentially, the look-ahead information needed at each node is encoded by a 
bottom-up relabeling into the label of each node (proof omitted). A bottom-up 
relabeling is obtained from an NBFA A by simply relabeling every node u by 
the state of A that processes u. 

7.2 Ranges 

The range of a VPC'^^° program P is defined as Tp(Ps) := {t\3s gPs ■ (s, t) G 
Tp} and can in general not be described by a generalized DTD. Consider, e.g., 

® Here, we denote regular languages by regular expressions. 
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the program Pcopy which takes as input XML documents consisting of a 

list list(aij • • • of items oi, . . . , only; -Pcopy should transform this list into 
two lists consisting of the serial numbers and prices of ai , . . . , Ofe , respectively. 
Thus, s = list(aij • • • should be transformed into t = out(list(&ij ' ' ’bi^) 
list(pi,^ • • where the bi are the corresponding numbers, and pi the prices. 

The program Pcopy has the rules: ri = (go, root(a;), list(/i, / 2 )), where root(a;) 
denotes the MSO formula which is true iff x is the root node and fi maps 
to the MSO formula E{x,y) (and the other states to false) and /2 maps qp to 
E(x,y) (and the other states to false). Then for q^ and qp there are rules X 2 = 
{q#,Oai{x),bi) and = (gp, Oa, (a^),Pi) for i G [fc]. Now, s is transformed 
by Pcopy into t. Clearly, the range Tp^^^^lTs) of Pcopy cannot be generated by 
a generalized DTD, because the string of leaf labels from left to right is a non 
context-free string language (for pi = bi it is {ww \ w G {6i, . . . , 6„}*}). 

We show, however, that the output of a VTCfl^° program can always be 
restricted to a (generalized) DTD. Similar results are known for ranked top- 
down tree transducers [24,8]. 

It follows from Theorem 16 and results of Fiilop [14], that it is even imdecid- 
able whether the output schema of a T>TC^^° (or even a program can 

be described by a (generalized) DTD. We now exhibit some relevant optimiza- 
tion problems that are decidable: it is decidable whether or not the range of a 
program is empty or finite. 

Theorem 18. The class of output languages of T>TCff^° programs is (1) closed 
under intersection with generalized DTD’s and (2) has a decidable emptiness 
and finiteness problem. 

Sketch of Proof: The proof of (1) is left out; it is based on closure of ex- 
tended DTDs under intersection and the fact that inverses of macro tree trans- 
ducers (working on binary encodings of unranked trees) preserve regular tree 
languages (Theorem 7.4(1) of [13]). (2) It is well known that unranked forests 
can be coded by binary trees (see, e.g., [20]). Figure 7 shows the forest t = 
a{aia 2 {bib 2 )a 3 )S{a 4 ) and its binary encoding enc(f). Here, the edges are labeled 
for clarity: y-edges indicate the edges between nodes, while <-edges indicate the 
ordering of siblings. Intuitively, for every unranked forest s the first child of a 
node u in its encoding enc(s) is the first child of the corresponding node in s 
(viz. a y-edge), and the second child of u in enc(s) is the right sibling of v in s (viz. 
a <-edge). For technical reasons we only use binary labels in the encodings, plus 
the constant symbol nil. We will now show how to simulate a T OV^ M by a tree 
transducer N which works on the binary encodings, i.e., enc“^ o N o enc = tm ■ 
Clearly, the range of M is finite (empty) iff the output language of N is finite 
(empty). Let us now discuss the tree transducer N. A (usual) top-down tree 
transducer over ranked trees is not sufficient, because in order to simulate a rule 
with right-hand side of the form Li • • • P„ with Li G Reg(Q U {0}), the binary 
version must generate the translations of the Li “on top” of each other, i.e., the 
corresponding root nodes are (rightmost) descendants of each other. 

There is a tree transducer model which can simulate this kind of behaviour 
by the additional use of parameters: the macro tree transducer (with look-ahead) 
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Fig. 7. Binary encoding and decoding of an unranked forest 

(MTT) [13,10]. The idea is that for every regular language L in the right-hand 
side of a rule of M the MTT N has a state qL which simulates this language. 
It must do this on descendants of the form ul2*. Each state (besides go) is of 
rank two which means that it has two parameters y\ and y 2 - An occurrence of 
{qL,Xi){ti,t 2 ) in the right-hand side of an MTT means that qL should process 
the ith son of the current node with t\,t 2 as yi,y 2 , respectively (cf. [10]). By 
look-ahead (and the states of N) we can determine which state q of M should 
translate the current node. Thus, the right-hand side of a {qL, (r)-rule with look- 
ahead q (on the rightmost input subtree) is obtained from the encoding of the 
right-hand side ( of the {q, cr)-rule of M by replacing each regular language L' 
(which appears as L'(nil, nil) in enc(C)) by {qL>,Xi). Additionally, replace the 
rightmost nil-labeled leaf by {qL, a; 2 )(nil, 2 / 2 )- 

If, e.g., C, = Li ■ ■ ■ Ln, then we get for N a right-hand side of the form 

(qii,a;i)(nil, (g^^, a;i)(nil, . . . {qL„, Xi){ml, {qL, X 2 ) {nil, y2))))- 

Finally, since the first son of a state is always nil, we need rules (gz,, nil)(yi, j/ 2 ) — > 
j/ 2 - The formal construction is straightforward. Then decidability of emptiness 
and finiteness follows from Lemma 3.14 and Theorem 4.5 of [6], respectively. 

□ 



7.3 Safe Transformations 

We now define a dynamic restriction on programs that allows copying 

but nevertheless induces transformations of only linear size increase. We bound 
the number of times that any node u of an input tree s may be selected during 
the transformation of s. If k is such a bound {copying bound), then clearly the 
size of an output tree is bounded hy k ■ p times the size of the input tree, where 
p is the size of the largest right-hand side. Hence, we get a linear size increase. 
A program for which there exists such a bound k is called safe. In 

terms of tree transducers this is the natural notion of finite copying (cf. [12]). 



Structured Document Transformations Based on XSL 



97 



Consider the program -Pcopy discussed in Section 7.2. Clearly Pcopy is 

safe: it has copying bound 2. The program P of Example 7 is not safe: The leaf 
of the monadic input tree of height n is transformed 2"“^ times, i.e., there is no 
copying bound for P. 

We note now, without proof, that safeness of T>T£^^° programs is decidable. 
The proof uses Theorem 18(2) and is similar to the proof of Lemma 3 in [11]. 

Theorem 19. Safeness of programs is deeidahle. 
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Abstract. Document specification languages like XML, model docu- 
ments using extended context-free grammars. These differ from standard 
context-free grammars in that they allow arbitrary regular expressions 
on the right-hand side of productions. To query such documents, we in- 
troduce a new form of attribute grammars (extended AGs) that work 
directly over extended context-free grammars rather than over standard 
context-free grammars. Viewed as a query language, extended AGs are 
particularly relevant as they can take into account the inherent order of 
the children of a node in a document. We show that two key properties 
of standard attribute grammars carry over to extended AGs: efficiency of 
evaluation and decidability of well-definedness. We further characterize 
the expressiveness of extended AGs in terms of monadic second-order 
logic and establish the complexity of their non-emptiness and equiva- 
lence problem to be complete for EXPTIME. As an application we show 
that the Region Algebra expressions can be efficiently translated into 
extended AGs. This translation drastically improves the known upper 
bound on the complexity of the emptiness and equivalence test for Re- 
gion Algebra expressions. 



1 Introduction 

Structured document databases can be seen as derivation trees of some grammar 
which functions as the “schema” of the database [2,4,19,20,22,32,36]. Document 
specification languages like, e.g., XML [12], model documents using extended 
context-free grammars. Extended context-free grammars (ECFG) are context- 
free grammars (CFG) having regular expressions over grammar symbols on the 
right-hand side of productions. It is known that ECFGs generate the same class 
of string languages as CFGs. Hence, from a formal language point of view, 
ECFGs are nothing but shorthands for CFGs. However, when grammars are 
used to model documents, i.e., when also the derivation trees are taken into con- 
sideration, the difference between CFGs and ECFGs becomes apparent. Indeed, 
compare Figure 1 and Figure 2. They both model a list of poems, but the CFG 
needs the extra non-terminals PoemList, VerseList, WordList, and LetterList to 
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allow for an arbitrary number of poems, verses, words, and letters. These non- 
terminals, however, have no meaning at the level of the logical specification of 
the document. 

A crucial difference between derivation trees of CFGs and derivation trees of 
ECFGs is that the former are ranked while the latter are not. In other words, 
nodes in a derivation tree of an EGFG need not have a fixed maximal number 
of children. While ranked trees have been studied in depth [17,38], unranked 
trees only recently received new attention in the context of SGML and XML. 
Based on work of Pair and Quere [33] and Takahashi [37], Murata defined a 
bottom-up automaton model for unranked trees [26]. This required describing 
transition functions for an arbitrary number of children. Murata’s approach is 
the following: a node is assigned a state by checking the sequence of states 
assigned to its children for membership in a regular language. In this way, the 
“infinite” transition function is represented in a finite way. We will extend this 
idea to attribute grammars. Briiggemann-Klein, Murata and Wood initiated an 
extensive study of tree automata over unranked trees [9]. 

The classical formalism of attribute grammars, introduced by Knuth [25], 
has always been a prominent framework for expressing computations on deriva- 
tion trees. Therefore, in previous work, we investigated attribute grammars as a 
query language for derivation trees of GFGs [28,31,32]. Attribute grammars pro- 
vide a mechanism for annotating the nodes of a tree with so-called “attributes” , 
by means of so-called “semantic rules” which can work either bottom-up (for 
so-called “synthesized” attribute values) or top-down (for so-called “inherited” 
attribute values). Attribute grammars are applied in such diverse fields of com- 
puter science as compiler construction and software engineering (for a survey, 
see [14]). 

Inspired by the idea of representing transition functions for automata on im- 
ranked trees as regular string languages, we introduce extended attribute gram- 
mars (extended AGs) that work directly over EGFGs rather than over standard 
GFGs. The main difficulty in achieving this is that the right-hand sides of pro- 
ductions contain regular expressions that, in general, specify infinite string lan- 
guages. This gives rise to two problems for the definition of extended AGs that 
are not present for standard AGs: 

(*) in a production, there may be an unbounded number of grammar symbols 
for which attributes should be defined; and 
(ii) the definition of an attribute should take into account that the number of 
attributes it depends on may be unbounded. 

We resolve these problems in the following way. For (*) , we only consider unam- 
biguous regular expressions in the right-hand sides of productions.^ This means 
that every child of a node derived by the production p = X r corresponds to 
exactly one position in r. We then define attributes uniformly for every position 

^ This is no loss of generality, as any regular language can be denoted by an un- 
ambiguous regular expression [7j. SGML is even more restrictive as it allows only 
one- unambiguous regular languages [8]. 
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DB — > PoemList 
PoemList — > Poem PoemList 
PoemList — > Poem 
Poem — > VerseList 
VerseList ^ Verse VerseList 
VerseList ^ Verse 
Verse — > WordList 
WordList — > Word WordList 
WordList — > Word 
Word — > LetterList 
LetterList — > Letter LetterList 
LetterList — > Letter 
Letter — » a | . . . | z 

Fig. 1. A CFG modeling a list of poems 

DB ^ Poem+ 

Poem ^ Verse^ 

Verse — > Word”*" 

Word — > (a + • • • + z)”*" 

Fig. 2. An ECFG modeling a list of poems 



in r and for the left-hand side of p. For (ii), we only allow a finite set D as the 
semantic domain of the attributes and we represent semantic rules as regular 
languages over D much in the same way tree automata over unranked trees are 
defined. 

By carefully tailoring the semantics of inherited attributes, extended AGs can 
take into account the inherent order of the children of a node in a document. 
This makes extended AGs particularly relevant as a query language. Indeed, as 
argued by Suciu [36], achieving this capability is one of the major challenges 
when applying the techniques developed for semi-structured data [1] to XML- 
documents. 

An important subclass of queries in the context of structured document 
databases, are the queries that select those subtrees in a document that sat- 
isfy a certain pattern [3,23,24,27]. These are essentially unary queries: they map 
a document to a set of its nodes. Extended AGs are especially tailored to ex- 
press such unary queries: the result of an extended AG consists of those nodes 
for which the value of a designated attribute equals 1.^ 

The contributions of this paper can be summarized as follows: 

1. We introduce extended attribute grammars as a query language for struc- 
tured document databases defined by EGFGs. Queries in this query lan- 
guage can be evaluated in time quadratic in the number of nodes of the 



^ We always assume that D contains the values 0 and 1 {false and true). 
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tree. We show that non-circularity, the property that an attribute gram- 
mar is well-defined for every tree, is in EXPTIME. The latter is also a 
lower bound since deciding non-circularity for standard attribute grammar 
is already known to be hard for EXPTIME [25,21]. 

2. We generalize our earlier results on standard attribute grammars [5,32] by 
showing that extended AGs express precisely the unary queries definable 
in monadic second-order logic (MSO). 

3. We establish the EXPTIME-completeness of the non-emptiness (given an 
extended AG, does there exist a tree of which a node is selected by this 
extended AG?) and of the equivalence problem of extended AGs. 

4. We show that Region Algebra expressions (introduced by Gonsens and Milo 
[11]) can be simulated by extended AGs. Stated as such, the result is not 
surprising, since the former essentially corresponds to a fragment of first- 
order logic over trees while the latter corresponds to full MSO. We, however, 
exhibit an efficient translation, which gives rise to a drastic improvement 
on the complexity of the equivalence problem of Region Algebra expres- 
sions. To be precise, Gonsens and Milo first translate each Region Algebra 
expression into an equivalent first-order logic formula on trees and then 
invoke the known algorithm testing decidability of such formulas. Unfor- 
tunately, the latter algorithm has non-elementary complexity. That is, the 
complexity of this algorithm cannot be bounded by an elementary func- 
tion (i.e., an iterated exponential 2"(2~ . . . (2")) where n is the size of the 
input). This approach therefore conceals the real complexity of the equiva- 
lence test of Region Algebra expressions. Our efficient translation of Region 
Algebra expressions into extended AGs, however, gives an EXPTIME al- 
gorithm. The thus obtained upper bound more closely matches the coNP 
lower bound [11]. 

This paper is further organized as follows. In Section 2, we recall some basic 
definitions. In Section 3, we give an example introducing the important ideas for 
the definition of extended AGs introduced in Section 4. In Section 5, we obtain 
the exact complexity of the non-circularity test for extended AGs. In Section 6, 
we characterize the expressiveness of extended AGs in terms of monadic second- 
order logic. In Section 7, we establish the exact complexity of the emptiness 
and equivalence problem of extended AGs. We then use this result to improve 
the complexity of the emptiness and equivalence problem of Region Algebra 
expressions in Section 8. We present some concluding remarks in Section 9. 

Due to space limitations, most proofs are omitted. They can be found in the 
author’s Phd thesis [29]. 

2 Preliminaries 

Let N denote the set of natural numbers. For a finite set S, we denote by [S'] 
the cardinality of S. For integers i and j, we denote by [i, j] the set {i, . ■ ■ , j}. 
In the following, A is a finite alphabet. If = oi • • • a„ is a string over E then 
we denote Oi by w(i) for i S {1, . . . , n}. We denote the length of w by |u>|. 
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For a regular expression r over S, we denote by L(r) the language defined 
by r and by Sym(r) the set of 17-symbols occurring in r. The marking f of r is 
obtained by subscribing in r the first occurrence of a symbol of Sym(r) by 1, the 
second by 2, and so on. For example, ai(a 2 -l- 63 )*a 4 is the marking of a(a-|- 6 *)*a. 
We let |r| denote the number of occurrences of 27-symbols in r, while r{i) denotes 
the 27-symbol at the fth occurrence in r. Let 27 be the alplrabet obtained from 
27 by subscribing every symbol by all natural numbers, i.e., 27 := {at | a S 27, f € 
N}. If w G 27* then denotes the string obtained from w by dropping the 
subscripts. 

In the definition of extended AGs we shall restrict ourselves to unambiguous 
regular expressions defined as follows: 

Definition 1. A regular expression r over 27 is unambiguous if for all v,w € 
L{f), implies v = w. 

That is, a regular expression r is unambiguous if every string in L(r) can be 
matched to r in only one way. For example, the regular expression (a -I- h)* is 
unambiguous while (aa -I- a)* is not. Indeed, it is easily checked that the string 
aa can be matched to (aa -I- a)* in two different ways. 

The following proposition, obtained by Book et al. [7], says that the restric- 
tion to unambiguous regular expressions is no loss of generality. 

Proposition 2. For every regular language R there exists an unambiguous reg- 
ular expression r such that L{r) = R. 

If w is a string and r is an unambiguous regular expression with w G L(r), 
then Wr denotes the unique string over 27 such that wf = w and Wr G L{r). For 
i = l,...,|w|, define pos^{i, w) as the subscript of the ith letter in Wr- Intuitively, 
pos^(i,w) indicates the position in r matching the ith letter of w. For example, 
if r = a{b-\- a)* and w = abba, then r = ai{b 2 + as)* and Wr = aib 2 b 203 . Hence, 

pos^(l,w) = 1, pos^(2,w) = 2, pos^(3,w) = 2, and poSr(4,?u) = 3. 

In the sequel, when we say regular expression, we always mean unambiguous 
regular expression. 

Extended AGs are defined over extended context-free grammars which are 
defined as follows. An extended context-free grammar (EGFG) is a tuple G = 
(TV, T, P, U), where T and N are disjoint finite non-empty sets, called the set of 
terminals and non-terminals, respectively; C/ G TV is the start symbol; and P is 
a set of productions consisting of rules of the form X ^ r where X G N and r is 
a regular expression over TV U T such that e ^ L{r) and L{r) 0. Additionally, 
if — !■ ri and X ^ r 2 belong to P then L{ri) n L{r 2 ) yf 0. 

A derivation tree t over an EGFG G is a tree labelled with symbols from 
Nut such that the root of t is labelled with U ; for every interior node n with 
children ni, . . . ,n.m there exists a production X ^ r such that n is labelled 
with X, for i = 1, . . . ,m, is labelled with Xi, and Xi - ■ ■ Xm G L{r); we say 
that n is derived by 2f — > r; and every leaf node is labelled with a terminal. We 
denote by root(t) the root node of t. 
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Note that derivation trees of ECFGs are unranked in the sense that the 
number of children of a node need not be bounded by any constant and does 
not depend on the label of that node. 

Throughout the paper we make the harmless technical assumption that the 
start symbol does not occur on the right-hand side of a production. 

3 Example 

We give a small example introducing the important ideas for the definition of 
extended attribute grammars in the next section. 

First, we briefly illustrate the mechanism of attribute grammars by giving an 
example of a Boolean valued standard attribute grammar (BAG) . The latter are 
studied by Neven and Van den Bussche [28,31,32]. As mentioned in the intro- 
duction, attribute grammars provide a mechanism for annotating the nodes of a 
tree with so-called “attributes” , by means of so-called “semantic rules” . A BAG 
assigns Boolean values by means of propositional logic formulas to attributes of 
nodes of input trees. Gonsider the GFG consisting of the productions U — > AA, 
A ^ a, and A ^ b. The following BAG selects the first A whenever the first A 
is expanded to an a and the second A is expanded to a b: 

U AA select{l) := is_a(l) A ^is_a(2); 

A -s- a is_a(0) := true 

A b js_a(0) := false 

Here, the 1 in select {1) indicates that the attribute select of the first A is being 
defined. Moreover, this attribute is true whenever the first A is expanded to an a 
(that is, zs_a(l) should be true) and the second A is expanded to a 6 (that is, 
is_a(2) should be false). The following rules then define the attribute is-a in the 
obvious way. In the above, 0 refers to the left-hand side of the rule. 

Gonsider the EGFG consisting of the sole rule U ^ {A + B)* . We now want 
to construct an attribute grammar selecting those A’s that are preceded by an 
even number of A’s and succeeded by an odd number of H’s. Like above we will 
use rules defining the attribute select. This gives rise to two problems not present 
for BAGs : (i) U can have an unbounded number of children labelled with A 
which implies that an unbounded number of attributes should be defined; (ii) 
the definition of an attribute of an A depends on its siblings, whose number is 
again unbounded. 

We resolve this in the following way. For (i), we just define select uniformly 
for each node that corresponds to the first position in the regular expression {A+ 
B)* . For (ii), we use regular languages as semantic rules rather than propositional 
formulas. The following extended AG now expresses the above query: 

U — > {A + B)* select{l) := (cti = lab^(j 2 = lab; 

= {B*AB*AB*)*#A*BA*{A*BA*BA*)*, 

R false = {A + B + ff)* — Rtrue)- 
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The 1 in select{l) indicates that the attribute select is defined uniformly for 
every node corresponding to the first position in {A + B)* . In the first part of 
the semantic rule, each ai lists the attributes of position i that will be used. Here, 
both for position 1 and 2 this is only the attribute lab which is a special attribute 
containing the label of the node. Consider the input tree U{AAABBB). Then, to 
check, for instance, whether the third A is selected we enumerate the attributes 
mentioned in the first part of the rule and insert the symbol ^ before the node 
under consideration. This gives us the string 

1 1 1 2 2 2 position in {A + S)* 

A A #A B B B 

1 2 3 4 5 6 position in AAABBB 

The attribute select of the third child will be assigned the value true since the 
above string belongs to Rtrue- Note that 

{B*AB*AB*y and A* BA* {A* BA* BA*)* 

define the set of strings with an even number of A’s and with an odd number 
of B’s, respectively. The above will be defined formally in the next section. 

4 Attribute Grammars over Extended Context-Free 
Grammars 

In this section we define extended attribute grammars (extended AGs) over 
ECFGs whose attributes can take only values from a finite set D. Hence, we 
leave the framework of only Booleans as attribute values. Nevertheless, we still 
have the equivalence with MSO as is shown in the next section. 

Unless explicitly stated otherwise, we always assume an ECFG G = (A, T, P, 
U). When we say tree we always mean derivation tree of G. 

Definitions. An attribute grammar vocabulary is a tuple (D, A, Syn, Inh), 
where 

— D is a finite set of values called the semantic domain. We assume that D 
always contains the Boolean values 0 and 1; 

— A is a finite set of symbols called attributes; we always assume that A 
contains the attribute lab; 

— Syn and Inh are functions from A U T to the powerset of A — {lab} such 
that for every A G A, Syn(A) nlnh(A) = 0; for every X G T, Syn(A) = 0; 
and Inh([/) = 0. 

If a G Syn(A), we say that a is a synthesized attribute of A. If a G Inh(A), we 
say that a is an inherited attribute of X. We also agree that lab is an attribute 
of every A (this is a predefined attribute; for each node its value will be the 
label of that node). The above conditions express that an attribute cannot be 
a synthesized and an inherited attribute of the same grammar symbol, that 
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terminal symbols do not have synthesized attributes, and that the start symbol 
does not have inherited attributes. 

We now formally define the semantic rules of extended AGs. For a production 
p = X ^ r, define p(0) = X, and for i G [1, |r|], define p{i) = r{i). We fix some 
attribute grammar vocabulary (A, D, Syn, Inh) in the following definitions. 

Definition 4. 1. Let p = A — > r be a production of G and let a be an 

attribute of p{i) for some i G [0, |r|]. The triple (p,a,i) is called a context 
if a G Syn(p(i)) implies i = 0, and a G Inh(p(z)) implies i > 0. 

2. A rule in the context {p,a,i) is an expression of the form 

u(z) := (fjo, . . . , fJ|^| j 



where 

— for j = [0, |r|], cTj is a sequence of attributes of p(j); 

— if i = 0 then, for each d G D, Rd is a regular language over the 
alphabet D; and 

— if i > 0 then, for each d G D, Rd is a regular language over the 
alphabet D U {#}. 

For all d^d' G D, ii d ^ d' then Rd H Rd' = 0. Further, if z = 0 then 
= D*. If z > 0 then {jd&D^d should contain all strings over D 
with exactly one occurrence of the symbol #. Note that a Rd is allowed 
to contain strings with several occurrences of We always assume that 

An extended AG is then defined as follows: 

Definition 5. An extended attribute grammar (extended AG) T consists of an 
attribute grammar vocabulary, together with a mapping assigning to each con- 
text a rule in that context. 

It will always be understood which rule is associated to which context. We illus- 
trate the above definitions with an example. 

Example 6. In Figure 3 an example of an extended AG T is depicted over 
the EGFG of Figure 2. Recall that every grammar symbol has the attribute 
lab; for each node this attribute has the label of that node as value. We have 
Syn(Word) = {king, lord}, Syn(Verse) = {kingJord}, Syn(Poem) = {result}, 
and Inh(Poem) = {first}. The grammar symbols DB, a, . . . , z. Verse, and Word 
have no attributes apart from lab. The semantics of this extended AG will be 
explained below. Here, D = {0, 1, a, . . . , z, DB, Poem, Verse, Word}. We use reg- 
ular expressions to define the languages R\; for the first rule, Rq is defined as 
(D U {#})* — i?i; for all other rules, i?o is defined as D* — Rp, those Rd that are 
not specified are empty; e stands for the empty sequence of attributes. □ 

The semantics of an extended AG is that it defines attributes of the nodes 
of derivation trees of the underlying grammar G. This is formalized next. 
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DB ^ Poem'*' first{l) := {ao = lab,ai = lab-,Ri = DB^Poem'^) 

Poem ^ Verse^ result{0) := (ao = first, ai = kingjord-, 

Ri = 1(1 + 0)* +0(l(l + 0))*(l + e)) 
Verse ^ Word''" kingjord(0) := (ao = e,ai = (king, lord)-, 

Ri = (0 + 1)* + 1 + (0 + 1)*) 
Word ^ (a + . . . + z)~^ king(0) := (ao = e, cti = lab, . . . , CT 26 = lab-, Ri = {king}) 
lord(0) -.= (ao = e,ai = lab, . . . , CT 26 = lab-, Ri = {lord}) 



Fig. 3. Example of an extended AG 



Definition 7. If t is a derivation tree of G then a valuation u o/ t is a function 
that maps each pair (n,a), where n is a node in t and a is an attribute of the 
label of n, to an element of D, and that maps for every n, v((lab, n)) to the label 

of n. 

In the sequel, for a pair (n, a) as above we will use the more intuitive notation 
a(n). To define the semantics of T we first need the following definition. If 
(T = tti • • • Ofc is a sequence of attributes and n is a node of t, then define (r(n) 
as the sequence of attribute-node pairs cr(n) = ai(n) • • • afe(n). 

Definition 8. Let t be a derivation tree, n a node of t, and a an attribute of 
the label of n. 

Synthesized Let ni , . . . , n^ be the children of n derived hy p = X ^ r, and let 
(cto, . . . , CTpi; (Rd)de d) be the rule associated to the context (p, a, 0). Define 
for I + [l,m], ji = pos^(l, w), where w is the string formed by the labels of 
the children of n. Then define IT(a(n)) as the sequence 

CTo(n) • CTjy(ni) • • •<Tj^(n„). 

For each d, we denote the language Rd associated to a(n) by R^^d^^ ■ 
Inherited Let ni, . . . , nfc_i be the left siblings, n^+i, . . . , n^ be the right sib- 
lings, and no be the parent of n. Let no be derived hy p = X ^ r, and 
define for I G [1, m], ji = pos^(Z, w), where w is the string formed by the la- 
bels of the children of no- Let (cto) • • • {Rd)d£D) be the rule associated 
to the context (p,a,jk)- Now define IT(a(n)) as the sequence 

CTo(no) • CTji (ni) • • • - # • aj^{n) ■ ■ ■ 

For each d, we denote the language Rd associated to a(n) by R^d^^ ■ 

If u is a valuation then define u(IT(a(n))) as the string obtained from IF(a(n)) 
by replacing each 6(m) in IF(a(n)) by v(&(m)). Note that the empty sequence 
is just replaced by the empty string. 

We are now ready to define the semantics of an extended AG IF on a deriva- 
tion tree. 
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Definition 9. Given an extended AG T and a derivation tree t, we define a 
sequence of partial valuations as follows: 

1. iFo(t) is the valuation that maps, for every node n, lab{n) to the label of 
n and is undefined everywhere else; 

2. for j > 0, if is defined on all 6(m) occurring in VF(a(n)) then 

J^j(t)(a(n)) = d where Tj-iiW{a{n))) S ■ Note that this is well 
defined. 

If for every t there is an I such that lF;(t) is totally defined (this implies that 
= J^/_i(t)) then we say that T is non-circular. Obviously, non-circularity is 
an important property. In the next section we show that it is decidable whether 
an extended AG is non-circular. Therefore, in the sequel, we always assume an 
extended AG to be non-circular. 

Definition 10. The valuation lF(t) equals Ri(t) with I such that Ri(t) = 
Ri-t-iit). 

We will use the following definition of a query: 

Definition 11. A query is a function mapping each derivation tree to a set of 
its nodes. 

An extended AG T can be used in a simple way to express queries. Among the 
attributes in the vocabulary of IF, we designate some attribute result, and define: 



Definition 12. An extended AG T expresses the query Q defined by 
Q(t) = {n I T{t) {result (n)) = 1}, 



for every tree t. 

Example 13. Recall the extended AG T of Figure 3. This extended AG selects 
the first poem and every poem that has the strings king or lord in every other 
verse starting from the first one. In Figure 4 an illustration is given of the result 
of T on a derivation tree t. At each node n, we show the values 1F(1F (a(n))) and 
J^(t)(a(n)). We abbreviate a(n) by a, king by k, lord by I, and kingdord by kd. 

The definition of the inherited attribute first indicates how the use of ff 
can distinguish in a uniform way between different occurrences of the grammar 
symbol Poem. This is only a simple example. In the next section we show that 
extended AGs can express all queries definable in MSO. Hence, they can also 
specify all relationships between siblings definable in MSO. 

The language Ri associated to result (cf. Figure 3), contains those strings 
representing that the current Poem is the first one, or representing that for every 
other verse starting at the first one the value of the attribute kingdord is 1. □ 
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Fig. 4. A derivation tree and its valuation as defined by the extended AG in 
Figure 3 



5 Deciding Non-circularity Is in EXPTIME 

In this section we show that it is decidable whether an extended AG is non- 
circular. In particular, we show that deciding non-circularity is in EXPTIME. 
As it is well known that deciding non-circularity of standard AGs is complete for 
EXPTIME [21], going from ranked to unranked does not increase the complexity 
of the non-circularity problem. 

A naive approach to testing non-circularity is to transform an extended AG 
!F into a standard AG T' such that T is non-circular if and only if T' is non- 
circular and then use the known exponential algorithm on T' . We can namely 
always find an integer N (polynomially depending on such that we only have 
to test non-circularity of T on trees of rank iV. Unfortunately, this approach 
exponentially increases the size of the AG. Indeed, a production A'^(a-|- 
b) ■ ■ ■ {a + b) (n times), for example, has to be translated to the set of productions 
{X -I- w I w S {a,b}* A I'lcl = n}. So, the complexity of the above algorithm 
is double exponential time. Therefore, we abandon this approach and give a 
different algorithm whose complexity is in EXPTIME. 

To this end, we generalize the tree walking automata of Bloem and Engel- 
friet [6] to unranked trees. We then show that for each extended AG IF, there 
exists a tree walking automata Wjr such that T is non-circular if and only if Wjr 
does not cycle. Moreover, the size of Wjr is polynomial in the size of T . We then 
obtain our result by showing that testing whether a tree walking automaton 
cycles is in EXPTIME. We omit the details. 
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Theorem 14. Deciding non- circularity of extended AGs is EXPTIME-com- 
plete. 



6 Expressiveness of Extended AGs 

In this section we characterize the expressiveness of extended AGs as the queries 
definable in monadic second-order logic. 

A derivation tree t can be viewed naturally as a finite relational structure 
(in the sense of mathematical logic [16]) over the binary relation symbols {E, <} 
and the unary relation symbols {Oa \ a G JV U T}. The domain of t, viewed as 
a structure, equals the set of nodes of t. The relation E in t equals the set of 
pairs (n,n') such that n' is a child of n in t. The relation < in t equals the set 
of pairs (n, n') such that n' yf n, n' and n are children of the same parent and 
n' is a child occurring after n. The set Oa in t equals the set of a-labeled nodes 
of t. Monadic second-order logic (MSO) allows the use of set variables ranging 
over sets of nodes of a tree, in addition to the individual variables ranging over 
the nodes themselves as provided by first-order logic (see, e.g., [16]). MSO can 
be used in the standard way to define queries. If <p(a;) is an MSO-formula, then 
(p defines the query Q defined by Q(t) := {n | t ^ <p[n]}. 

We obtain: 

Theorem 15. A query is expressible by an extended AG if and only if it is 
definable in MSO. 

The proof of the above theorem is similar to the proof relating BAGs to MSO [32]. 
In MSO we can use set variables to represent assignment of values to attributes. 
For the other direction we make use of the ability of extended AGs to compute 
MSO-types of trees. The only complication arises from the fact that trees are 
unranked. See [29] for more details. 

7 Optimization 

We now obtain the exact complexity of some relevant optimization problems for 
extended AGs. These results will be used in the next section to obtain a new 
upper bound for deciding equivalence of Region Algebra expressions introduced 
by Gonsens and Milo [1 1] . We represent the regular languages Rd in the semantic 
rules by nondeterministic finite acceptors (NFAs). The size of an extended AG 
is the size of the attribute grammar vocabulary plus the size of the NFAs in the 
semantic rules. Gonsider the following optimization problems: 

— NON-EMPTINESS: Given an extended AG iF, does there exists a tree t 
and a node n of t such that E{t) {result (n)) = 1? 

— EQUIVALENGE: Given two extended AGs Ei and E 2 over the same gram- 
mar, do El and E 2 express the same query? 



We obtain: 
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Theorem 16. NON-EMPTINESS and EQUIVALENCE are EXPTIME-com- 
plete. 

We outline the proof. Hardness is shown by a reduction from TWO PLAYER 
CORRIDOR TILING [10]. To show membership, we transform a finite AG E 
into a non-deterministic bottom-up automaton Mjr (NBTA) over unranked trees 
such that Mjr accepts a tree if and only if T is non-empty. The size of Mjr 
is exponential in the size of T and the non-emptiness test of NBTAs can be 
done in PTIME. Hence, testing non-emptiness of extended AGs can be done 
in EXPTIME. Further, it can be shown that equivalence can be reduced to 
non-emptiness in polynomial time. 

8 An Application of Extended AGs: Optimization of 
Region Algebra Expressions 

The region algebra introduced by Consens and Milo [11] is a set-at-a-time al- 
gebra, based on the PAT algebra [-35], for manipulating text regions. In this 
section we show that any Region Algebra expression can be simulated by an 
extended AG of polynomial size. This then leads to an EXPTIME algorithm 
for the equivalence and emptiness test of Region Algebra expressions. The al- 
gorithm of Consens and Milo is based on the equivalence test for first-order 
logic formulas over trees which has a non-elementary lower bound. Our algo- 
rithm therefore drastically improves the complexity of the equivalence test for 
the Region Algebra and matches more closely the coNP lower bound [11]. 

It should be pointed out that our definition differs slightly from the one 
in [11]. Indeed, we restrict ourselves to regular languages as patterns, while 
Consens and Milo do not use a particular pattern language. This is no loss of 
generality since 

— on the one hand, regular languages are the most commonly used pattern 
language in the context of document databases; and, 

— on the other hand, the huge complexity of the algorithm of [11] is not due 
to the pattern language at hand, but is due to quantifier alternation of the 
resulting first-order logic formula, induced by combinations of the operators 

’ (difference) and <, >, C, and D. 

A region index sehema X = (Si^ . . . , Sn, V) consists of a set of region names 
Si, . . . , Sn and a finite alphabet E. If is a natural number, then a region over 
A is a pair (i,j) with i < j and i,j € A}. An instance / of a region 

index schema X consists of a string I{oj) = ai . . .uni & E* with A/ > 0, and a 
mapping associating to each region name S a set of regions over A/. 

We abbreviate r G Ur=i-^(‘^i) by r G /. We use the notation L(r) (respec- 
tively R{r)) to denote the location of the left (respectively right) endpoint of a 
region r and denote by uj{r) the string ai(r) . . • ai?(r)- 

Example 17. Consider the region index schema X = (Proc, Func, Var, X). In 
Figure 5 an example of an instance over X is depicted. Here, A/ = 16, I{uj) = 
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abcdefghijklmnop, /(Proc) = {(1, 16), (6, 10)}, /(Func) = {(12,16)} and 
/(Var) = {(2,3), (6, 7), (12, 13)}. □ 
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Fig. 5. An instance / over the region index schema of Example 17 



For two regions r and s in / define: 

~ r < s if R{r) < L{s) (r precedes s); and 

— r C s if L(s) < L{r) and i?(r) < i?(s), or L(s) < L(r) and i?(r) < R{s) (r 

is included in s). 

We also allow the dual operators r > s and r D s which have the obvious 
meaning. An instance / is hierarchical if 

— /(S') n /(S") = 0 for all region names S and S' in I, and 

— for all r, s G I, one of the following holds: r < s, s < r, r C s or s C r. 



The last condition simply says that if two regions overlap then one is strictly 
contained in the other. The instance in Figure 5 is hierarchical. Like in [11], we 
only consider hierarchical instances. We next define the Region Algebra. 

Definition 18. Region Algebra expressions over X = (Si, . . . , S„, A) are induc- 
tively defined as follows: 

— every region name of I is a Region Algebra expression; 

— if Cl and 62 are Region Algebra expressions then ci U 62, 61—62, 61 C 
62, 6i < 62, 61 D 62, and 6i > 62 are also Region Algebra expressions; 

— if e is a Region Algebra expression and /? is a regular language then an{e) 
is a Region Algebra expression. 



The semantics of a Region Algebra expression on an instance / is defined as 
follows: 



lMe)f 

[61 U62f 
[61 - 62^ 



{r I r e I{S)}- 
{r I r € |e]'^ and uj{r) G Rj; 
U 162]’^; 

- [62!'^; 
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and for * e {<, >, C, D}: 

|ei *62]“^ := {r I r € |ei]“^ and 3 s S { 62 ^ such that r*s}. 
As an example consider the Region Algebra expression 
Proc D crx'*starti:*(Pi'oc) 



defining all the Proc regions which contain a Proc region that contains the 
string start. 

An important observation is that for any region index schema X = {Si , . . . , 
Sm SJ) there exists an ECFG Gx such that any hierarchical instance of X ‘cor- 
responds’ to a derivation tree of Gx- This ECFG is defined as follows: Gx = 
(TV, T, P, C7), with N = {S'!, . . . , Sn}, T = S, and where P consists of the rules 

Po := U {Si -b . . . 3- 5^ + 

Pi := Si ^{Si + ... + Sn + E)+; 

Pn ■■= Sr^^{Si + ... + Sr^ + E)+. 

For example, the derivation tree tj of Gx representing the instance / of Figure 5 
is depicted in Figure 6. Regions in I then correspond to nodes in tj in the obvious 
way. We denote the node in tj that corresponds to the region r by n^. 
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Fig. 6. The tree t/ corresponding to the instance / of Figure 5 



Since extended AGs can store results of subcomputations in their attributes, 
they are naturally closed under composition. It is, hence, no surprise that the 
translation of Region Algebra expressions into extended AGs proceeds by induc- 
tion on the structure of the former. 

Lemma 19. For every Region Algebra expression e over X there exists an ex- 
tended AG Fe over Gx such that for every hierarchical instance I and region 
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r a I , r £ |e]^ if and only if {result e{n.r)) = 1. Moreover, Te can he 

eonstrueted in time polynomial in the size of e. 

We need the following notion to state the main result of this section. A Region 
Algebra expression e over X is empty if for every hierarchical instance / over I, 
|e]^ = 0. Two Region Algebra expressions ei and 62 over X are equivalent if for 
every hierarchical instance I over J, |ei] = |e 2 ] • 

Theorem 20. Testing emptiness and equivalenee of Region Algebra expressions 
IS in EXPTIME. 

9 Discussion 

In other work [30], Schwentick and the present author defined query automata 
to query structured documents. Query automata are two-way automata over 
(un)ranked trees that can select nodes depending on the current state and on 
the label at these nodes. Query automata can express precisely the unary MSO 
definable queries and have an EXPTIME-complete equivalence problem. This 
makes them look rather similar to extended AGs. The two formalisms are, how- 
ever, very different in nature. Indeed, query automata constitute a procedural 
formalism that has only local memory (in the state of the automaton), but which 
can visit each node more than a constant number of times. Attribute grammars, 
on the other hand, are a declarative formalism, whose evaluation visits each 
node of the input tree only a constant number of times (once for each attribute) . 
In addition, they have a distributed memory (in the attributes at each node). 
It is precisely this distributed memory which makes extended AGs particularly 
well-suited for an efficient simulation of Region Algebra expressions. It is, hence, 
not clear whether there exists an effieient translation from Region Algebra ex- 
pressions into query automata. 

Extended AGs can only express queries that retrieve subtrees from a doc- 
ument. It would be interesting to see whether the present formalism can be 
extended to also take restructuring of documents into account. A related pa- 
per in this respect is that of Grescenzi and Mecca [13]. They define an inter- 
esting formalism for the definition of wrappers that map derivation trees of 
regular grammars to relational databases. Their formalism, however, is only de- 
fined for regular grammars and the correspondence between actions (i.e., se- 
mantic rules) and grammar symbols occurring in regular expressions is not 
so flexible as for extended AGs. Other work that uses attribute grammars in 
the context of databases includes work of Abiteboul, Gluet, and Milo [2] and 
Kilpelainen et al. [22]. 
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Abstract. Construction of complex software systems with off-the-shelf 
components has become a reality. Component-based frameworks tailored 
specifically for the domain of database integration are lacking, however. To 
use an existing component framework, data integrators must construct cus- 
tom components specialized to the tasks of the data integration problem at 
hand. This approach allows other components provided by the framework to 
be reused, but is overly tedious and requires the integrator to employ the pro- 
gramming paradigms assumed by the component framework for interconnec- 
tion and intercommunication between components, and manipulation of data 
provided by them. An alternate approach would employ a framework con- 
taining components tailored to data integration and which allows them to be 
interconnected using programming methods that are more natural to the do- 
main of data integration. Souk is a language-independent, component-based 
paradigm for data integration. It is designed to allow the rapid construction 
of data integration solutions from off-the-shelf components, and to allow 
flexible evolution. This paper gives an overview of this paradigm. 

1 Introduction 

This work addresses database, or data source, integration specifically through the use 
of a component-based framework and programming paradigm specifically designed to 
support the rapid construction of data integration solutions in the context of distributed 
object environments, and which can be easily adapted to evolve in the face of changes 
to the underlying local data sources and changing client requirements. 
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The main goal of database integration is to create a global database which incorporates 
and encapsulates data from a community of discrete, possibly heterogeneous, local 
databases', and which can accept requests from clients to retrieve or update data 
managed by the community of local databases without regard to their location in a 
network of data sources, representation, data model, access methods, or languages 
required to properly specify the requests of the individual local databases [4]. Thus, in 
meeting this goal, the global database might be required to accept a query expression 
from a client, manage its translation into several sub-queries in different query 
languages or method invocations, retrieve the results, and convert them into the 
representation and data model required by the client. Furthermore, the global database 
is most likely constrained to respect the separate and independent design, control, and 
administration of the local databases with which it interacts in order to service client 
requests. For example, the global database may be required to adapt to independently 
initiated schema design changes to a local database, or the global database may not be 
able to implement global transaction semantics by accessing information about the 
transaction schedules of local data sources. This separate and independent design, 
control, and administration is called local autonomy. We will refer to local databases 
from now on using the more general term, local data source, to indicate that data to be 
integrated may not be managed by a proper database management system, but may be 
some other type of system, such as a file system or an application. 

There are several classical approaches to data source integration. First, the 
schemas from all local data sources are integrated into a common data model to form a 
global schema, and then queries in a corresponding common query language are 
performed against this global schema [2]. This approach becomes intractable when the 
number of local data sources becomes large. Second, the intractability of global schema 
integration can be dealt with by performing integration on only those parts of local 
schemas that are exported by local data sources [16]. Third, schema integration can be 
avoided altogether in lieu of a system that can accept requests in the form of global 
query language expressions or method invocations and translate them accordingly into 
the query languages or method invocations corresponding to the underlying local data 
sources required to satisfy the requests [7]. Finally, highly customized point solutions 
can be constructed based on the particularities of the local data sources to be integrated. 
These are costly and often not easy to adapt to evolution of the local data sources. 

2 Our approach 

Our approach to data integration is orthogonal to those discussed above: construction 
of data integration solutions using a component-based framework, but using a 
programming or modeling paradigm that allows methods that are more natural to the 
domain of data integration. Current component-based frameworks force the data 
integrator to use tedious, imperative programming techniques and tedious procedures 
for binding and communicating with other components or objects in the framework. 
The complimentary and problematic nature of data integrator needs and distributed 
object environments is being recognized by industrial data integration providers such as 
Cohera [10]. The collection of components in our framework allow the modularization 
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of functionality used in well-known approaches to data integration, such as view 
construction, set operations, and data transformations. There is considerable overlap in 
the issues that must be addressed in each of the existing data integration approaches, 
including schema integration and translation, query decomposition and query language 
translation, maintenance of consistency and dependency constraints across local data 
sources, and global transaction processing. Data sources can and are now being 
wrapped or constructed using mature distributed object and component-based software 
systems. These realities can be factored to produce a covering set of components, and a 
paradigm for communication and coordination between them. The result can be a 
covering set of components which allow the data integrator to rapidly construct 
solutions taking one of the dominant approaches discussed above. 

2.1 Component-based Software 

Szyperski defines software components as binary units of independent production, 
acquisition, and deployment that interact to form a functioning system [27]. 
Independence allows for multiple independent developers, and the ability to integrate 
at the binary level ensures robust integration. Our definition of components are units 
which define specialized, prefabricated functionality whose instances can be combined 
with other components to construct solutions to data integration problems involving 
multiple, distributed data sources. Data sources may or may not be traditional database 
management systems. Data sources in this context are assumed to be wrapped using a 
technology such as CORE A [22]. 

We have identified an additional set of requirements that we think a component 
language must meet to support data integration. A component language for data 
integration must provide features or constructs that are: tailored to data definition, 
manipulation, and integration; interface manipulation; amenable to component 
evolution analysis; capable of expressing event subscription, event notification and 
fault tolerance logic; capable of expressing global constraints; amenable to dynamic 
interface discovery or “trading” in COREA paralance [?]; and, finally, it must be 
amenable to highly scalable methods of composition, such as pattern-based software 
construction. The work described in this paper seeks to address meta-language issues 
around component interaction, data manipulation, event handling, and pattern-based 
integration. 

3 Perspectives on Modeling Data Integration Solutions 

The overriding requirement for our paradigm is that it should support programming 
approaches that are at or near the level of a data manipulation language. Existing 
component languages do not allow this. 

Carriero and Geletner’s [8] taxonomy of conceptual classes of parallel programming 
approaches provide a useful start here. We can envision the development of a data 
integration in terms of the results of the data integration process, the agenda of tasks 
that must be performed to achieve integration, or the ensemble of specialists required to 
perform data integration. We believe that these three paradigms are relevant 
classifications for the predominant data integration approaches described above. We do 
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not suggest that each data integration approach falls squarely within one of these 
parallel programming approaches; rather, different aspects of each data integration 
approach are similar enough to be useful to us here. 

Global database query languages provide result-oriented solutions. They allow a 
more “natural” means of expressing desired results solutions than do imperative 
programing languages. Schema integration and federation can be naturally articulated 
as an agenda of tasks that must be executed to achieve the desired integration. 
Customized data integration approaches and some aspects of the other data integration 
approaches require ensembles of specialists for certain parts of the solution. 

Our work has been influenced in various ways by a number of research efforts 
involving modular or component-based data integration. These include a la Carte [12], 
COMANDOS [3], Garlic [25], InterBase [6], MIND [11], Pegasus [26], and TSIMMIS 
[14]. These projects variously addressed aspects of componentization of multidatabase 
systems from transaction management to query processing, the use of standard 
distributed object technologies, and the use of mediators [30] as the unit of 
componentization. InterBase, Garlic, and Pegasus provide fairly complete 
decompositions of the overall problem space of distributed data integration. Our work 
is focused in particular on tailoring current component-based software construction 
approaches, embodied in technologies such as Enterprise Java Beans [28], to the 
domain of distributed data integration. Mediators are to provide the containers for our 
components. 

The paradigm we are developing is intended to support the three programming or 
modeling approaches of expressing: result-oriented solutions, agenda-based solutions, 
and solutions based on the use of ensembles of specialists. Just as important, our 
paradigm must also provide a bridge between existing component-based programming 
paradigms and environments, because the reality is that while technologies like 
CORBA provide “glue” for components in such systems, but once they are glued 
together a programming paradigm that is not natural to database programming is 
required to exchange and manipulate data that are exchanged between the components. 
For example, the tasks of locating object references and performing object binding is 
still relatively low-level, tedious and technology dependent. Data-oriented tasks, such 
as the submitting of queries and the retrieval of result sets using technologies such as 
JDBC [31] must also be implemented in a way that is low-level, brittle with respect to 
the evolution of data integration solutions, and imperative, usually in great contrast to 
the underlying query expressions being passed from client to server. Component-based 
frameworks such as Enterprise Java Beans provide powerful mechanisms such as 
reflection and contracts for addressing these problems, but components therein must 
ultimately be manipulated in ways that are at odds (e.g. imperative and low-level) with 
conceptual models of distributed data integration solutions. 

Our hypothesis is that Petri net graphs can provide a useful modeling tool for data 
integration in this context. They provide an abstract and formal way of modeling: data 
flow and control problems seen in data integration, concurrent and asynchronous 
behavior often extant and useful in distributed and database systems, and event 
detection and temporality. Petri nets offer many analysis techniques applicable to data 
integration, including reachability, coverability, and boundedness [23]. Variants of the 




An Overview of Souk Nets 121 



general Petri net model have proven useful for the modeling of composite event 
detection and active database systems architectures [15, 19], and the specification and 
verification of distributed databases and software systems [5, 29]. The particular variant 
of Petri nets we use has the added benefit of being able to analyze and model dynamic 
changes to a network. We expect to apply these capabilities to model the handling of 
changes in data integration requirements and fault tolerance of a running data 
integration solution. 

4 The Souk Network Specification Model 

Our specification model for data integration is the Souk Net (SNet), a variant of 
WorkFlow Nets (WoFNets) designed to model data integration process flows. WoFNets 
are a class of Colored Petri nets [23]. This class of Petri nets has been used by Ellis and 
Keddara to model workflow processes and dynamic change in them [13]. An SNet is a 
bi-partite net with two kinds of nodes, places and transitions. They are colored in the 
sense that the tokens in an SNet can be assigned color types to indicate what type of 
information they carry: data type, control type (e.g. parameters), or combinations of 
both. The distribution of tokens in the SNet represent its state. Tokens may be 
distributed over places and transitions. 

Each transition in an SNet is associated with one of several types of data 
integration components. A component is executed when the associated transition //rej. 
The association between transitions and components is injective (i.e. each instance of a 
component type is uniquely associated with a transition). Each transition has at least one 
input place and at least one output place. A transition’s connectors are associated with 
the ports of its associated component through a total and injective labeling over port 
names. 

Each SNet has a single entry place and a single exit place. A net is connected, and 
each node in the net is reachable from the entry and may lead to the exit place. The firing 
of an AEN starts the moment a token is injected into its entry place, and completes when 
a token is produced into its exit place. Below, we discuss the modeling elements of our 
variant of WoENets. 

4.1 The Modeling Elements 

A Souk process (S-process) is a procedure where data and control objects are passed 
between Souk-components (S -components) according to a well-defined set of 
semantics to achieve or contribute to an overall set of data integration goals. Each S- 
process is defined in the context of a container. It will normally be the case that the 
container for an S-process is a method in the object-oriented sense, and that collections 
of these methods will be implemented within a mediator or some other type of 
distributed object. Thus, each S-process constitutes all or part of an implementation of 
a method defined in the mediator’s interface. 

Each S-process defines a configuration of interconnected S-components and data 
places, and the flow of control and data among these network entities. An S-component 
(figure 1) may be one of several types, each type designed to perform specific 
operations on data object sets or control information. Operations on the elements of a 
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data object set may be at either or both the semantic or structural levels. The flow of an 
S-process is specified hy interconnecting its component using connectors (i.e. edges). 



input data ports 



external data places 
(e.g. DBMS, file system) 




precondition 

(an optional active guard) 
output data port 



internal data place 

S-component 

(data or control operation) 



Fig.1 The basic structure of an S-component. 



Data places may be internal or external to an S-process. An internal data place is 
declared, created, and managed as part of an S-process. Internal data places are 
repositories that behave like Linda tuple spaces [8], except that they may only hold one 
token and the data contained therein may be complex objects. An external data place is 
a local data source whose existence and operation may he independent and autonomous 
of the S-process. A local data source can be a proper database management systems, a 
file system, or an application program. Each local data source is encapsulated by a 
server object. These server objects are assumed to be objects in some distributed 
environment such as CORBA[22] or DCOM[18]. Each local data source involved in an 
S-process is accessed through the interface of its associated server object. S-processes 
manipulate sets of dafa which originate from and are ultimately stored in local data 
sources. 

S-processes make no assumptions about the structure of fhe dafa objecfs on which 
they operate. To support data integration data objects in our model must be able to 
represent arbitrarily complex data, including INE tuples, structurally complex objects, 
nested relations, text streams, and byte streams. We have, therefore, chosen as our 
global data model the data type system for sef forth by CORE A IDE [22]. Thus, data 
objects may be of any type expressible in IDL and data object sets are IDE sequences 
having data objects as their elements. 

An S-component is modeled using a hlack box approach with respect to the 
structure of the data object sets it processes. Room prevents us from discussing fhe 
intricacies of S-component firings here. It suffices fo say that the general firing process 
of an S-componenf has fhree phases: 

Phase 1: fhe S-componenf consumes one token from each of its input places. 

Phase 2: the S-component performs ifs specified operation based on the informa- 
tion contained in the tokens it has consumed; 

Phase 3: the S-component produces one new token for each of its output places. 
S-components may he broadly classified as: 

Filters: components that alter the membership of a data object set according to 
criteria specified by fhe modeler of fhe S-process. Filters support the database notion of 
selection. They do not alter the structure of fheir input data object sets. The filtering 
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criteria are referred to as the filtering guards and are specified by a logical expression 
(e.g. in some predicate calculus or algebra). Each filter has one input data port and one 
output data port. 

Transforms: components that perform specific structural or semantic (e.g. data 
value) transformations on a single input data object set. Transforms support the 
database notions of view construction and schema modification. Each transform has 
one single input data port and one single output data port. 

Blenders: components that combine several data inputs into one single data 
output. The blending may be structural, elemental, or both. A structural blending is used 
to perform view construction over multiple data sources. Element-wise blendings are 
essentially set-based operations and support such operations as union, intersection, set 
difference, set division, Cartesian product, and join. 

Controllers: components that manage the routing of tokens in an S-process. 
Controllers support the global database notion of global query decomposition, where a 
query is split into several sub-queries, each sent to a specific local data source. A 
controller may also be used to combine or decompose parameter values coming from 
different components. 

A data port provides a conduit for transferring tokens between an S-component and one 
of its data places. An S-component transfers data object sets between its data places and 
itself using the methods produce and consume, which are defined for all data ports in 
attached to an S-component. If the methods produce or consume are to be invoked on a 
data port to or from an internal data place, a built-in implementation of these methods 
can be used. 

To achieve data object set transfers between an S-component and an external data 
place, however, the implementations of the produce and consume methods must be 
overridden because we cannot assume which methods in the interface of the server 
object that wraps that external data place must be invoked to transfer data to and from 
it. This must be done for each data port attached to an external data place. 

The usual Petri net semantics of token consumption will not usually apply for 
external data places, as the corpus of data in local data source will not likely be entirely 
removed during a query. Rather, data contained in the token that represents the contents 
of a local data source will either be copied, augmented or removed, but the token itself 
will remain. These special semantics for external data places allow us to retain the usual 
firing semantics for interior of an S-process since an S-component connected to an 
external data place needs only to receive a control token (e.g. a request involving the 
external data place) to be enabled. We simplify our graphical notation by using a double 
arrow for a data port if both updates and retrievals are possible from an external data 
place. 
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The server objects that wrap external data places provide two fundamental ways of 
transferring data object sets across their interfaces, using parameterized accessors or 
cursor-based accessors. Parameterized accessors provide for the retrieval data object 
sets via out or in out parameters. Iterative accessors provide for the retrieval of data 
object sets, first, using an initialize method which returns a cursor, and then repeated 
invocations of an iterator method to extract each element of the data object set resulting 
from the initialize method invocation. Thus, the data object sets transferred by 
parameterized accessors can be scalar values or sequences, and those transferred by 
iterative accessors are individual data objects which are aggregated into a sequence one 
at a time. 

We give an example in figure 2 of the handling of a global query, its 
decomposition into subqueries for each of the mediator’s constituent local data sources, 
and the return of the results. This S-process accepts as input a token from place in 
the form of a global query expression, queryg. S-component Cj is a transform created 
such that it can parse and decompose queryg into query expressions in the dialects or 
languages required by external data sources EP j and EP2- The resulting expressions are 
query -f and query £■ Once the tokens containing query i and query 2 reach data places 
P] and P^ respectively, the controller components responsible for interacting with the 
external data sources, C2 and Cj, are enabled and fire. S-component C4, a blender, then 
combines the results, result] and result 2, in some way meaningful to the data integrator 
to produce result g. If a global query involves only one external data place, component 
Cj will generate a null query for the other external data place, a no-op that is ignored by 
the controller that receives it. 

4.2 Event Subscription and Notification 

Each S-component may observe events and may perform event notifications [9, 24]. 
Rules are declared, in the form of an S-process, for execution when a component 
observes an event to which it has subscribed and that event indicates a specified 
condition. Events in our model support the notion of global database constraints and the 
specification of a course of action if constraints are violated. Rules may themselves 
subscribe to other events and perform event notifications to handle events that are 
triggered during the execution of the rules. An event subscription is represented as a 
precondition which adorns an S-component, depicted using a triangle. It should be 
noted that the use of a Petri net approach allows us to readily apply the efficient 
composite event detection approach developed by Gatziu and Dittrich [15]. 
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Suppose in the example shown in figure 2, we are required to update an ADDRESS 
attribute value in EP 2 whenever an ADDRESS update is sent to EPy. This event 
subscription and corresponding rule can be specified as shown in figure 3. The event 
subscription is the expression written over the triangle. It specifies that whenever 
component Cj, in figure 2, generates a query i token that is an update, this rule should 
be fired. The guard on component C5 of the rule tests if the update is to ADDRESS. If 
it is, an ADDRESS update is generated for EP2 and sent to data place P4 to replace the 
no-op token that would have been generated by Cj. The S-process in figure 2 then 
resumes with the new query token that has been placed in P4. If the guard is false, then 
the rule ends. 

Modular design is further supported in Souk Networks through the use of Souk- 
macros (S-macro). An S-macro is an S-component that references another S-process. 
We call this type of S-process an S-subprocess. The S-subprocess is executed whenever 
the macro starts. The S-process may in turn contain S-macros and so on. Thus, arbitrary 
process nestings can be achieved. 

5 Conclusion 

We have presented a new component-based paradigm that is designed to support a more 
natural programming paradigm for data integration than provided by existing 
component-based frameworks and the use of distributed objects external to our 
framework. The programming paradigm is supported through the use of a special 
variant of Petri nets, called SoukNets. The SoukNet model contains special types of 
transitions called S-components which modularize common or custom data integration 
operations, such as query translation and decomposition, set operations, and view 
construction. Data to be integrated are transferred among components in our framework 
as tokens. Control information such as query expressions and parameters are passed in 
tokens as well. We support a global data model which can be used to represent data 
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coming from local data sources of virtually any complexity and structure. Local data 
sources are assumed in our paradigm to be wrapped by distributed objects. These are 
represented as special types of places called external data places. Transitions which 
interact with external data places use overridden versions of the data transfer methods 
produce and consume built in to the SoukNet model. The ability to override the normal 
Petri net token consumption and production semantics to account for the specifics of 
interfaces provided by external data places allows our model be applied to the 
integration of data and services provided hy objects resident in distributed object 
environments such as CORBA and DCOM. 



C5 




SoukNet articulates a model and a modeling paradigm not a specific database 
programming language. It is a meta-language. We are in the process of developing a 
database programming and integration language called COIL which is an instance of the 
SoukNet meta-language [20]. 

SoukNets are based on WoFNets which are designed to be adaptive [13]. That is, 
they can be used to analyze and model dynamic changes to a Petri net. We expect to 
apply these capabilities to model the handling of changes in data integration 
requirements and fault tolerance of a running data integration solution. Finally, we are 
addressing the modeling in SoukNet of global transactions, environment-wide mediator 
coordination [1], and the specification of inter-mediator and inter-object behavioral 
constraints [17]. 
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Abstract. The transient keyword of the Java'''*' programming language 
was originally introduced to prevent specihc class fields from being stored 
by a persistence mechanism. In the context of orthogonal persistence, this 
is a particularly useful feature, since it allows the developer to easily deal 
with state that is external to the system. Such state is inherently tran- 
sient and should not be stored, but instead re-created when necessary. 
Unfortunately, the Java Language Specification does not accurately de- 
fine the semantics and correct usage of the transient keyword. This has 
left it open to misinterpretation by third parties and its current meaning 
is tied to the popular Java Object Serialisation mechanism. In this pa- 
per we explain why the currently widely-accepted use of the transient 
keyword is not appropriate in the context of orthogonal persistence, we 
present a more detailed definition for it, and we show how the handling 
of transient fields can be efficiently implemented in an orthogonally per- 
sistent system, while preserving the desired semantics. 



1 Introduction 

When programming in PJama, an orthogonally persistent system for the Java^'*' 
programming language [5,3], it is often necessary to mark fields of objects as 
transient so that the system does not write their contents to disk. This is crucial 
when dealing with state external to the system, that inherently cannot be stored 
on disk, but instead must be re-created when necessary (this is an important 
requirement for all open persistent systems — see Section 3). 

In the context of the Java language, the natural way to mark such fields as 
transient would have been to use the transient keyword. Unfortunately, the 
Java Language Specification [14] gives a very loose definition of the semantics 
and the correct usage of transient. This has allowed third parties to misin- 
terpret it and tie it to the popular and widely-used Java Object Serialisation 
mechanism [32]. Since this mechanism is part of the Java platform, this use 
of transient has become a de-facto standard and has been propagated to the 
platform’s standard libraries [33]. 
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In this paper we argue that the current use of the transient keyword is 
overloaded with two different meanings, one of which is not fully compatible 
with persistence. This has forced both the PJama team and GemStone Inc. to 
introduce in their persistent systems ad-hoc ways for ignoring some occurrences 
of transient in the standard libraries of the JDK platform (see Section 4.3 
and 4.4 respectively). 

We believe that the correct interpretation of the transient keyword is an 
issue that should be addressed by the designers of the Java language, since it 
can ultimately apply to the majority of systems that provide persistence for the 
Java language, not only the orthogonal ones. 



1.1 Paper Overview 

Section 2 gives a brief overview of the PJama system and its architecture. Sec- 
tion 3 establishes the need for transient data in open orthogonally persistent 
systems. Section 4 includes an overview of the current interpretations of the 
transient keyword, gives concrete examples from the standard classes of the 
Java platform that show that its use is overloaded, and presents the reasons 
why PJama and GemStone Inc. have to ignore some occurrences of transient 
for the purposes of their storage systems. Section 5 describes the way transient 
fields are handled in the PJama system. Finally, Section 6 concludes the paper. 

2 Overview of PJama 

Orthogonal Persistence [6] is a language-independent model of persistence de- 
fined by the following three principles [19,4]. 

— Type Orthogonality. Persistence is available for all data, irrespective of 
type. 

— Persistence By Reachability^. The lifetime of all objects is determined 
by reachability from a designated set of root objects. 

— Persistence Independence. It is indistinguishable whether code is oper- 
ating on short-lived or long-lived data. 

PJama [5,3,22] is a system that provides orthogonal persistence for the Java 
programming language [14,1]. It was developed collaboratively between the De- 
partment of Gomputing Science at the University of Glasgow and the Research 
Laboratories of Sun Microsystems. It conforms to the three principles of orthog- 
onal persistence since 

— instances of any class can persist (this includes the classes themselves, as 
they are instances of the class Class, and their static fields), 

— all objects reachable from a set of declared roots become persistent, unless 
explicitly specified as transient (see Section 4.3), and 
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— code that operates over transient data can also operate over persistent data, 
with no changes to the original source or post-processing of the bytecodes 
being necessary. 

An additional requirement for the PJama system was to implement all the above 
without introducing any new constructs to or changing the syntax of the Java 
language. This way, third-party classes (even ones developed for the “vanilla” 
Java language) can persist^ using PJama, without requiring any changes to the 
original sources (i.e. .java files) or the compiled bytecodes (i.e. .class files) 
[5,3]. We have achieved this by introducing a small number of P Jama-specific 
classes that encompass the (minimal) API needed by the programmers to access 
the persistence facilities of PJama [35,20]. Some of these classes provide wrappers 
to native calls inside our customised interpreter (see Section 2.1). This approach 
also allows us to use the standard Java compiler (javac) unchanged in order to 
compile P Jama-specific classes. 

Unfortunately, our claim of complete type orthogonality is not yet entirely 
true. Even though the majority of classes can persist unchanged, there is a small 
number of them that either require some changes in order to persist or cannot 
persist at all. Examples of classes that require changes are ones that use the 
transient keyword in a manner incompatible with orthogonal persistence (see 
Section 4.3 for a lengthier explanation) or depend on static initialisers to load 
dynamic libraries^. Examples of classes that cannot persist at all are those tied 
closely to the implementation of the Virtual Machine (VM) (Thread, Exception, 
etc.). The issues of type orthogonality in the Java language are discussed in more 
detail by Jordan and Atkinson [18,19,4]. 

PJama has taken a much more aggressive approach than any other project 
that provides persistence for the Java language that we are aware of (this in- 
cludes systems that conform to the ODMG standard [8], GemStone/J [13], etc.) 
in order to achieve complete type orthogonality. However, we believe that the 
issues concerning the transient keyword, raised in the context of PJama, and 
hence the Java language, are significant for any other language that aspires 
to persistence. Many systems adopt partial persistence, i.e. only some classes 
can persistent (e.g. Java Object Serialisation [32]). For these systems a correct 
treatment of transient is essential, as it is the only way in which application 
programmers can specify how references to non-persistable objects can be dealt 
with. Our discussion, suggested semantics, and implementation are therefore of 
relevance to the design of future persistent languages and to all forms of persis- 
tence for the Java language. 

^ In this paper, by “a class can persist” we imply that its instances can also persist. 

® In PJama, the static initialiser of a class is invoked once, when the class is first 
loaded, and not every time it is fetched from the store by a new invocation of the 
VM. If some operations need to take place when a class is fetched from the store (e.g. 
dynamic library loading), this is done by the Global Runtime Events, as described 
in the Orthogonal Persistence for the Java Platform Proposal [20]. 
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2.1 The PJama Architecture 

This section provides a brief overview of the low-level architecture of the PJama 
system. It is included here, as it will be referred to in Section 5. 

PJama achieves the three principles of orthogonal persistence, mentioned in 
the previous section, by requiring changes to the VM. It can be argued that this 
is the only way to make some classes persistent (e.g. Class, Thread), since their 
state cannot be accessed from inside a Java program, hence a solution written 
entirely in the Java language would be inappropriate. In fact, GemStone Inc. have 
taken a similar approach for their GemStone/ J product [13]. 

The current PJama implementation is based on Sun’s Glassic JDK''''^ Plat- 
form. A high-level illustration of its architecture is given in Figure 1. The original 
JDK platform comprises the VM and the Transient Heap^, where objects are 
allocated and manipulated. PJama extends this by adding the Object Cache, 
where persistent objects are cached and manipulated. Objects in the transient 
heap and in the object cache appear the same to the VM; therefore the com- 
bination of the transient heap and the object cache can be viewed as a single 
Persistent Heap. In fact, these two components might be unified in future im- 
plementations of PJama. 

When an object in the transient heap becomes persistent (i.e. when there 
is a reference to it from a persistent object in the object cache), it is moved 
to the object cache and an image of it is written to disk, via the Disk Cache 
of the persistent store (this operation is called Object Promotion and is illus- 
trated in Figure 1). When an object needs to be fetched from disk, the buffer 

This is also known as the Garbage Collected Heap. 
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where it resides is copied to the disk cache and then the object is copied to the 
object cache (this operation is called Object- Faulting and is also illustrated in 
Figure 1). At this point, all the references in it are PIDs®. The VM will detect 
when one of these is dereferenced and will fetch the appropriate object from disk. 
More information on the memory management of PJama is given by Daynes and 
Atkinson [9]. 

The above architecture and the tight coupling between the components allows 
the performance of the PJama system to be very good, since all the object- 
fetching and dirty object-tracking is done entirely inside the runtime system. In 
fact, performance evaluation experiments by Ridgway et al. show PJama faster 
than all the approaches, written entirely in the Java language, that map objects 
to relational or object-oriented databases [28]. 

3 The Need for Transient Data in Orthogonally 
Persistent Systems 

Modern programming languages typically include a large number of standard 
facilities, such as commonly-used utilities and bindings to external resources. 
One of the trends in programming language design is to keep the core and 
syntax of the language minimal and bundle most of the required facilities as 
libraries. This allows the language to be left unchanged, when new facilities are 
added to it, and makes third-party library additions and experimentation easier. 
Examples of this trend are the C-I--I- platform [15] and the still-expanding Java 
platform [14,33]. 

In contrast, early orthogonally persistent languages, like PS-algol [2] and 
Napier88 [21], were designed as closed systems. This allowed their implementers 
to achieve complete type orthogonality, since the state of all the data that needed 
to persist was known to the system. This approach proved the feasability of 
orthogonal persistence. However, it was very limiting, as the addition of any 
new facilities to the system, which had to communicate with external resources 
and therefore could not be written in the language itself, had to be incorporated 
inside the core of the system and could not be bundled as external libraries. This 
was the case, for example, for the windowing system and sockets. This limitation 
has been one of the biggest criticisms of orthogonally persistent systems. 

The above has sparked an interest in orthogonally persistent systems that can 
handle external state through a well-defined interface. Such systems are called 
open. A severe complication arises however when an open persistent system 
attempts to make external state persistent. When it re-instates it, it will not 
know how to perform any required initialisation as it does not know the data 
contents, layout, and interpretation. The only way that this can be achieved in 
a generic manner is to delegate to a library, which manages the external state, 
the identification of which data should not be put into the store and how its 
re-initialisation should be handled. 

® Persistent Identifiers, the equivalent of references in the persistent store that 
uniquely identify persistent objects [26]. 
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Fig. 2. Handling External State in a Persistent System 



We will call data that the application or library chooses not to allow to 
become persistent, transient. An open persistent system should have a well- 
defined way of allowing users to mark data as transient and restore it to well- 
defined values, when moving it to disk, so that the next access to it can be 
detected and any necessary re-initialisation code run. 

The handling of external state using transient data is illustrated in Figure 2. 
Object W in memory contains the state of the image window in a format that 
the persistent system can understand and manipulate. Any changes to W are 
propagated, via calls to the appropriate library, to object ES that is the window 
state inside the windowing toolkit and is external to the persistent system. The 
reference from W to ES has been marked transient, hence the dashed arrow 
(notice that it is the reference field in object W that needs to be marked transient, 
not object ES itself). When object W is stored on disk, that reference will be cut 
and set to a default value. This will be detected, when W is re-activated, and 
the re-initialisation of ES will be triggered with a call to the windowing library. 

An interesting question, which arises in the above example, is what happens 
when object W is written to disk and then accessed shortly afterwards, within 
the same activation of the system. Obviously, we do not want the reference to 
ES to be cut, since this will force the image window to be re-created every time 
its state is written to disk. We believe that, during the same activation of the 
system, the reference to ES should be retained in memory, but cut every time it 
is written to disk. On a new activation of the system, only the first access to W 
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will detect that the reference had been cut; this will force an equivalent of ES 
to be re-created® and a reference to it stored in W. 

Past experience with using persistent systems has shown that there are three 
cases where programmers need to mark data as transient. 

— For Handling External State. As explained in the previous section, data 
that contains external state, which the system cannot handle, should not 
be written to the store but should be re-initialised when faulted-in. An ex- 
ample of this is the state of GUI components, which typically can only be 
interpreted by the windowing library. 

— As a Model of Resumable Programming. Apart from external state, 
there are other categories of data that need to be re-initialised when read by 
a new invocation of the VM. If the system sets them to a default value when 
it puts them to disk, the application can later detect that this has happened 
and re-initialise them. An example of this is a string containing the name of 
the machine on which the application is running, that has to be reset every 
time the application starts up. 

— For Efficiency. There are cases when it is more efficient to re-create data, 
rather than store it on disk. For example, application developers might 
choose to store rarely accessed data in a compressed format and re-create it 
upon use, in order to save space on disk. This technique is of course strictly 
an optimisation and it can be argued that it breaks the rules of type orthog- 
onality, as defined in Section 2. However, developers might choose to use it 
to tune their applications. 

We can now give the following definition. 

Definition 1 Transient data, in an open persistent system, is data that is not 
written to the persistent store, keeps its value throughout the same activation of 
the system, and is reset to a well-known default value, upon its first use in a new 
activation of the system. 

4 Defining Transient Data in the Java Language 

The Java programming language has a standard keyword called transient that 
was introduced to be used in situations similar to the ones presented in Section 3. 
However, its semantics have not been accurately defined and its current use in 
the standard libraries of the JDK platform is not completely compatible with 
Definition 1. The issues raised by this are discussed in this section. 

® Since the Java language is platform-independent, the next activation can be on a 
different architecture/OS/ windowing environment, thus an equivalent rather than 
identical object is needed. 
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4.1 The transient Keyword in the Java Language 

In the Java Language Specification (JLS) version 1.0 [14, page 147], the 
transient keyword is defined as follows. 

Definition 2 “Variables may be marked transient to indicate that they are 
not part of the persistent state of an object. If an instance of the class Point; 

class Point { 
int X, y; 

transient float rho, theta; 

} 

were saved to persistent storage by a system service, then only the fields x and 
y would be saved. This specification does not yet specify details of such services; 
we intend to provide them in a future version of this specification. ” 

Even though the above definition loosely describes that transient fields should 
not be made persistent, it does not accurately define the following. 

— What the persistent state of an object is. 

— Under what circumstances a field should not be placed in persistent storage. 

— What values the transient fields are set to when the object is re-activated 
from persistent storage. 

— Whether transient fields retain their value, when an object is evicted from 
memory to disk and then re-read, within the same invocation of the VM. 

It turns out that the above details are very important when interpreting transient 
fields in the context of an orthogonally persistent system. In fact, if they are 
misinterpreted, they can negatively affect the programming model and break, in 
many cases, transparency and orthogonality. This will become apparent in later 
sections of the paper. 

4.2 Java Object Serialization 

Java Object Serialization (JOS) [32] is a mechanism that provides facilities to 
translate an object graph to and from a byte-stream. It was originally developed 
as part of the Java Remote Method Invocation (RMI) [34] framework^, which 
allows object graphs to be moved between different hosts. However, since a byte- 
stream can be stored on disk and then retrieved at a later time, it is easy to see 
how JOS can also be used as a persistence mechanism. 

Since the 1.1 release of the Java platform, JOS has been part of the 
standard java.io package, accessed via the DbjectInputStream and the 
ObjectOutputStream classes [33]. Apart from RMI, it is also used extensively 

^ Ken Arnold of Snn Microsystems, personal communication. 
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for persistence facilities in the JavaBeans'''^ [30] and JavaSpaces'''^ [31] frame- 
works. Since these frameworks are very high-profile and widely used, “JOS has 
effectively become the default persistence mechanism for the Java platform” [19]. 

We believe that there are numerous problems with using JOS as a persis- 
tence mechanism for production-quality code. These are presented in detail by 
Evans [10] and Jordan [17] and they are beyond the scope of this paper. We will 
concentrate instead on the transient keyword in the context of JOS. 



The transient Keyword and JOS For reasons similar to the ones presented 
in Section 3, a way of specifying that some fields of a class should not be serialised 
was necessary for JOS. Starting with the 1.1 release of the Java platform, the 
transient keyword has been used for this purpose. The JOS specification defines 
it as follows [32, pages 10-11]. 

Definition 3 “Transient fields are not persistent and will not he saved by any 
persistence mechanism. Marking the field will prevent the state from appearing 
in the stream and from being restored during deserialization. ” 

A definition similar to the above is also included in the Java Platform 1.2 Core 
API Specification [33, class java, io . ObjectOutputStream]. 

Definition 4 “The default serialization mechanism for an object writes the class 
of the object, the class signature, and the values of all non-transient and non- 
static fields. References to other objects (except in transient or static fields) cause 
those objects to be written also. ” 

From the above two quotes, it is clear that the de-facto definition of the 
transient keyword is “not serialisable by JOS”. However, the JLS [14] does not 
include any changes to the Java programming language, which were introduced 
in version 1.1 onwards, and hence does not “officially” embrace this definition. 
We are expecting however that its next edition will. 



Concrete Example: java, awt . Component The transient keyword is used 
extensively in the standard libraries of the JDK platform, in classes that can 
be serialised. The example below is from the Component class of the java, awt 
package [33]. 

package j ava . awt ; 

public abstract class Component 

implements java. io . Serializable , ... 

{ 

transient ComponentPeer peer; 



> 
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All AWT component classes extend this class and peer is the architecture/OS- 
dependent state of each component. It is marked transient so that it is re- 
created when the object is deserialised on another machine. This, in fact, is very 
similar to the situation illustrated in Figure 2, if we assume that object W in 
that figure is the Component object and object ES is the peer object. 

An interesting observation is that, even though a method that intialises the 
peer object exists, there is no mechanism to call it after deserialisation, as seems 
natural. Instead, every method of the Component class checks whether peer is 
null and, if it is, it then calls the initialisation method. This requirement for an 
explicit check before every use is extremely tedious, confusing, and error-prone. 



Concrete Example: j ava . util . Hashtable Another example we will consider 
is the Hashtable class of the java. util package [33]. 

package java. util; 

public class Hashtable 
extends Dictionary 

implements Cloneable, java. io . Serializable 

{ 

private transient HashtableEntry table [] ; 

> 

According to the above, the array that contains the hash table entries has been 
marked transient and will not be automatically serialised. At first sight, this 
indicates that the contents of a hashtable will not be saved and restored. In 
fact, it is an idiom which means that class Hashtable should provide special 
marshalling code (e.g. the writeObject and readObject methods [33]) to put 
the hashtable entries in the byte-stream and to retrieve them later. Effectively, 
the hashtable is “manually” serialised and re-created when deserialised. 

There are two reasons behind this “strange” behaviour. The first one is op- 
timisation. Assuming that hashtables are sparsely populated, serialising only 
their entries, rather than the table itself and all auxiliary objects, generates a 
smaller byte-stream. This is important since, as mentioned above, the original 
use of JOS was in RMI and smaller byte-streams are transmitted more efficiently 
over the network. For the same reasons, there are many collection classes, in the 
standard libraries of the JDK platform 1.2, that also have their principle data 
fields marked with transient. Examples include LinkedList, TreeMap, and 
TreeSet [33]. 

The second reason is that the location of each entry in the table depends on 
the entry’s hashcode. However, the JLS defines the hashCode method as follows 
[14, pages 459-460]. 

Definition 5 “Whenever it is invoked on the same object more then once during 
an execution of a Java application, hashCode must consistently return the same 
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integer. The integer may he positive, negative, or zero. This integer does not, 
however, have to remain consistent from one Java application to another, or from 
one execution of an application to another execution of the same application. ” 

According to the above definition, when a byte-stream is deserialised, possibly 
on a different machine, the new copies of the objects that are created might 
have different hashcodes. This forces the hashtable to have to be re-created, so 
that each of its entries is inserted in the correct slots, specified by its (possibly 
different) hashcode®. 



Concrete Example Summary The two concrete examples above illustrate 
that the transient keyword is overloaded by JOS with the following two mean- 
ings. 

— Not Serialisable. This means that the field should not be serialised and 
should be reset to its default value in the serialised byte-stream. This is the 
case for the Component example. 

— Special Marshalling Required. This means that the field should not 
be serialised “as is”, but special code should be run instead to seri- 
alise/deserialise it. This is the case for the Hashtable example. 

It is worth emphasising here that the second case above directly contradicts 
Definition 3, which states that “transient fields will not be saved by any persistent 
mechanism”, since it does allow data in transient fields to be serialised; this just 
happens by a different mechanism, namely the writeObject method [33]. 



The serialPersistentFields Mechanism In addition to the transient 
keyword, a new mechanism was introduced in the JDK1.2 platform that 
can be used to specify which fields should be serialised by JOS. An array 
with name serialPersistentFields and signature static final private 
ObjectStreamField [] can be introduced in a class to specify which fields of 
that class should be serialised [32, pages 4-5] (notice that this is the inverse of 
the transient keyword, which specifies the fields that should not be serialised). 
Its entries are instances of class ObjectStreamField, each of which represents 
a field of a given name and type. The use of this mechanism is illustrated in the 
following example®. 

class Point { 
int X, y; 
float rho, theta; 



° This is an example where internal state has been allowed to become external, in 
order to free implementers from constraints; erroneously in our opinion [19]. 

® The expression Integer . TYPE in the example denotes the class of the primitive type 
int. 
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static final private ObjectStreamField[] 
serialPersistentFields = { 
new ObjectStreamFieldC'x" , Integer . TYPE) , 
new ObjectStreamFieldCy" , Integer . TYPE) 

>; 

> 

According to the above, when an instance of class Point is serialised, only the 
fields X and y will be written to the generated byte-stream; the fields rho and 
theta will be considered transient. In fact, for the purposes of JOS, the above 
class will behave in an identical way to the one included in Definition 2, the 
latter using transient instead of serialPersistentFields. 

We believe that the use of serialPersistentFields is unecessarily awk- 
ward and verbose (e.g. compare the source of class Point above to the one 
included in Definition 2). Unfortunately, it is also unsafe because, even though 
the setPersistentFields field itself must be declared final and hence cannot 
be updated, its contents are not (in fact, they cannot be, according to the Java 
programming language [14]). By assigning new instances of ObjectStreamField 
to the array slots, one can arrange for different fields to be serialised at different 
times, with confusing consequences. Finally, using serialPersistentFields is 
also error-prone. A mispelling of its name or a small difference in signature will 
allow JOS to “quietly” ignore it, creating a very subtle debugging problem for 
the developer. This, in fact, happened to us when we were evaluating its use; it 
took us over 30 mins to spot the error. 

Even though it has shortcomings, this mechanism could ultimately elim- 
inate the use of the transient keyword for the purposes of JOS. This will 
make it “available” for use exclusively in the context of persistence. However, 
at least for now, this is not possible, as all but three classes in the standard 
libraries of the JDK platform 1.2 use the transient keyword in favour of 
serialPersistentFields. 



4.3 Defining Transient Fields in PJama 

In an ideal implementation of the PJama system, we would like the transient 
keyword to follow Definition 1. However, since we have built PJama on top of the 
JDK platform, we have also inherited the platform’s standard libraries, where 
transient meets the needs of JOS, as explained in Section 8. There are two 
issues that should be considered. 

— Persistent objects in PJama retain the same hashcode across multiple invo- 
cations of the system (this is basically an extra constraint on Definition 5) . 

— Specially encoding object graphs is not desirable in PJama since it will com- 
plicate its incremental object-fetching operation, it will break the concept of 
persistence independence, and the bandwidth issue, present in a distributed 
context, is less significant for persistence. 
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It follows that PJama should ignore, for the purposes of persistence, some occur- 
rences of the transient keyword in the standard libraries of the JDK platform 
(in classes that it can deal with, e.g. Hashtable and other collections), but not all 
of them (in classes that contain external state that PJama cannot deal with, e.g. 
Component). One solution to this would have been to remove the inappropriate 
occurences of transient. Unfortunately, this is not possible, since PJama uses 
JOS in the context of RMI [29] and JOS would not then operate correctly over 
such classes. 

This forced us to implement an alternative way of marking fields transient, 
which are considered as such only by PJama in the context of persistence, that 
does not interfere with JOS. By default, the transient keyword is ignored by 
the PJama system and the following method 

public static final void 
markTransient (String fieldNaune) ; 

has been introduced in the PJSystem P Jama-specific class [35]. This method can 
only be called inside the static initialiser of a class and notifies the PJama system 
that the static or instance field of that class with name fieldName should not 
be stored on disk^°. 

A concrete example of the use of markTransient is included below. It 
presents the Component class of the java.awt package (see Section 7), after 
it was modified for use in PJama (it is worth pointing out that, even though 
we had to modify a few more of the standard classes of the JDK platform in a 
similar way, their public API remained the same, allowing all existing, even non 
P Jama-specific, Java programs to operate unchanged). 

package j ava . awt ; 

import org.opj . utilities . PJSystem; 

public abstract class Component 

implements java. io . Serializable , ... 

{ 

transient ComponentPeer peer; 
transient Container parent; 

static { 

PJSystem.markTrcinsient ("peer" ) ; 

} 

} 

When a Component object is written to the store, the peer field, which con- 
tains the external state of the component, will be set to the null value, since 
markTransient has been called on it. However, the parent field will be saved 
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Naturally, appropriate exceptions will be thrown if markTransient is not called inside 
a static initialiser or the field name is invalid. 
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to disk, even though it has the modifier transient (this is desirable since in 
PJama we want to make the structure of a GUI persistent and the parent ob- 
ject is part of it). Since the above example can be slightly misleading, it is worth 
pointing out that the markTransient method can be called on fields that are 
not tagged with transient, if necessary. 

Even though the introduction of the markTrcinsient method is not particu- 
larly elegant, it is functional, straightforward to use, and allows the information 
on the transience of fields to be kept inside the class source, rather than some- 
where else. It is also safe, since instances of a class can only be created after 
its static initialiser has been invoked [14]. This ensures that, when a new in- 
stance is allocated, PJama has already been notified about which fields must be 
considered as transient. 

It is worth emphasising that our approach directly contradicts Definition 3, 
which states that “transient fields are not persistent and will not be saved by any 
persistence mechanism”, since PJama stores transient fields on disk by default, 
unless the method markTransient has been called on them. Though clearly un- 
desirable, this is necessary because of the overloaded meaning of the transient 
keyword, as explained in Section 8. 

4.4 GemStone’s Approach 

GemStone Inc. also had to define a way to deal with the overloaded mean- 
ing of the transient keyword in their GemStone/J product [13]. The resulting 
technique is more restrictive than the one adopted by PJama, described in Sec- 
tion 4.3. By default GemStone/J obeys the transient keyword. However, when 
the GemStone/J’s default repository (i.e. store) is built [12], the transient at- 
tribute for some standard classes (e.g. Hashtable, Date) is internally cleared for 
the purposes of the persistent object manager. This facility is internal and not 
available to users^^. 

The above is the opposite to the approach taken by PJama: GemStone/ J 
obeys the transient keyword, unless otherwise specified, as opposed to 
PJama that ignores the transient keyword, unless otherwise specified. However, 
the final outcome is the same in both cases: some fields prefixed with transient 
in the standard libraries are not treated as such by the storage mechanism. 

4.5 An Alternative Proposal 

A more radical way to deal with the overloaded meaning of transient is included 
in the Orthogonal Persistence for the Java Platform Proposal [20] and was orig- 
inally proposed by GemStone Inc. It involves extending the Java programming 
language with the introduction of the sub-modifiers serial and storage, that 
augment the transient keyword. Their meaning is as follows. 

— trcuisient serial only means “not serialisable” (i.e. only transient in the 

context of JOS). 
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— transient storage only means “not persistent” (i.e. only transient in the 
context of the persistence mechanism, in our case PJama). 

— transient on its own is the combination of both of the above, i.e. “not 
serialisable” and “not persistent”. 

The meaning of transient on its own does not differ from its current inter- 
pretation in the JDK platform (see Section 8), as it encompasses both serial 
and storage. This allows for backward compatibility, since existing classes that 
use transient will still operate correctly unchanged. Only if one of the sub- 
modifiers is introduced into a class, may changes to its methods be necessary to 
reflect the change in semantics. 

A concrete example of their use is given below. It is the Component class, 
presented in Section 4.3, amended appropriately. 

package j ava . awt ; 

public abstract class Component 

implements java. io . Serializable , ... 

{ 

transient ComponentPeer peer; 
transient serial Container parent; 

/* no markTransient is required */ 

> 

In the above, the parent field is marked transient serial, while the peer field 
remains marked transient. This allows for the correct handling of these fields 
by both PJama and JOS, eliminating the need for the markTransient method. 

It is worth emphasising that the serialPersistentFields mechanism, in 
conjuction with the transient keyword being used only for persistence, can also 
provide a way of distinguishing between transience for persistence and transience 
for JOS (see Section 8). However, we believe that the transient serial and 
transient storage sub-modifiers provide a much more compact, safer, clearer, 
and less error-prone way of doing so. 

4.6 A Note on Persistence vs. Distribution 

Earlier sections have explained why it is necessary to introduce a new way of 
marking fields transient for the purposes of persistence and how this should 
not interfere with JOS. Both PJama and GemStone Inc. have adopted ad-hoc 
solutions and the proposal for the serial and storage sub-modifiers attempts 
to introduce a consistent way of dealing with this. However, there is another 
related issue that should also be considered. 

As mentioned in an Section 4.2, JOS is used for both persistence (e.g. JavaS- 
paces) and distribution (e.g. RMI) in the Java platform. However, the trade-offs 
of persistence and distribution are radically different, mainly due to differences 
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in bandwidth^^ . For example, it is typically more efficient to minimise the size 
of a byte-stream, generated when serialising an object graph, so it is transmit- 
ted quicker over the network. This optimisation is not really necessary in the 
context of persistence, since the local disk of a machine can usually be accessed 
very efficiently^^. 

Apart from efficiency reasons, there is also the distinction between data that 
is allowed to persist but not be transmitted and vice versa. An example of 
the former is sensitive information (e.g. password) that the user/programmer 
would trust the local disk to store but would not trust the network to transmit. 
Examples of the latter are more subtle. However they do exist in the internals 
of PJRMI, the P Jama-specific extensions to RMI [29]. 

Given the above, we believe that any definition of transience should deal 
with the persistence/distribution separation (Evans also reaches the same con- 
clusion [10]). Ideally, in an orthogonally persistent system, JOS should only be 
used in the context of RMI and not to provide any persistence facilities. In this 
case, the proposed serial and storage sub-modifiers will be interpreted as not 
transmittable and not persistent, respectively. 



5 Handling Transient Fields in a Persistent System 

This section overviews implementation techniques for handling transient fields 
in persistent systems, while preserving the semantics of Definition 1. Section 5.1 
deals with issues raised when propagating updates to the persistent store and 
Section 5.2 covers issues concerning object eviction. 



5.1 Updating Objects 

Typically, persistent systems need to keep track of which persistent objects are 
updated (dirtied) during program execution, in order to propagate these updates 
to the persistent store. The accuracy of the dirtying information (i.e. object-grain 
or field-grain) is a trade-off between run-time performance (i.e. how fast it can 
be recorded), space requirements (i.e. how much space is required to record it), 
and stabilisation operation efficiency (i.e. how fast the system can detect which 
objects have been updated and propagate the changes to disk). 

In the current release of PJama (based on the JDK classic platform), the 
dirtying information is object-grain. Each persistent object has a dirty flag on 
its header, which is set when an update to it occurs. The stabilisation operation 
discovers which objects have their dirty flag set and updates their disk image. 
This works efficiently for small objects (i.e. most class instances), since they are 

Thanks to Alex Garthwaite for this observation. 

Interestingly enongh, there are situations where it is more efficient to access the 
memory of another machine through a LAN rather than the local disk [11]. Such 
techniques however are still at an experimental stage and normally require special 
operating system support. But much of the traffic carried by RMI is over much 
slower WANs. 
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very likely to reside on a single disk page. In this case, the I/O operation will 
always dwarf the memory copying operation from the object cache to the disk 
cache (see Figure 1). Large arrays are associated with a card table [16], which 
specifies which parts of the array are updated, in an attempt to minimise disk 
traffic. The mechanism is explained in more detail by Daynes and Atkinson [9]. 

The fact that persistent objects are updated in their entirity, rather than on 
a field-by-field basis, introduces a potential problem for objects which contain 
transient fields. Such fields have to be set to a default value, when they are 
written to the store, but their in-memory value has to be retained (as specified 
by Definition 1). An obvious approach to propagating updates to the store is to 
translate each object into store format in-place in the object cache (this involves 
translating all the references in it to PIDs and resetting the transient fields — 
this operation is called unswizzling) , copy these modified contents to the disk 
cache, and translate the original object image back to VM format. However, 
this actually discards the values of the transient fields, as they are overwritten 
with the default values, and it does not allow them to be restored at the end of 
the stabilisation operation, as required by Definition 1^^. Even though auxiliary 
data structures could be temporarily set up to hold the overwritten values, this 
would complicate the implementation and require extra space to be reserved. 

A more elegent solution is given here. Notice that for an object to be updated 
on disk, its image has to be copied from the object cache to the disk cache. This 
presents an opportunity to perform all the necessary translations in the disk- 
cache buffer, while leaving the original image of the object unchanged. This way, 
the transient fields will only be overwritten with the appropriate default values 
on the disk-cache buffer and not in the object cache, eliminating the problem 
mentioned above. 

In order for this to be possible, it is necessary for a well-defined interface to 
be defined between the persistent heap manager and the persistent store that 
allows the former to access the disk cache. Such an interface has been introduced 
in Sphere, the new persistent object store from Glasgow University that will be 
used in the next version of PJama [26,27,25], in the form of the Unswizzling 
Callbacks [23]. Every time an object-update call is invoked. Sphere performs 
an up-call (the unswizzling callback) to the persistent heap manager, providing 
it with the address of the object image in the disk-cache buffer, so that the 
necessary translations can be performed. 

The operation of the unswizzling callbacks is explained in more detail by 
Printezis and Atkinson [24]. Apart from handling transient fields, they can also 
be used to imswizzle references directly on the disk cache. This avoids the im- 
ecessary in-place translation of each object mentioned above. 

Their use will be illustrated using a concrete example. Figure 3 shows the 
initial state of the system. Object A is a persistent object, cached in memory, 
with PID (for the purposes of this example, the PID of each cached persistent 

The same problem does not apply to the unswizzled references, since there is a one- 
to-one mapping between the address of a persistent object and its PID, therefore 
translating between them is always possible. 
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object is stored in its header). It contains two references: one to object B, which 
is also persistent and has PID Pg, and a transient one (hence the dashed line) 
to object C. 

The first step in updating object A is illustrated in Figure 4. The object’s 
image is copied to the disk-cache buffer “as is” , with its reference fields still con- 
taining memory addresses. Then the unswizzling callback is invoked, accepting 
as a parameter the address of the object on the disk-cache buffer. 

Figure 5 illustrates the state of the system after the unswizzling callback 
has finished its operation. The reference to object B has been replaced by the 
object’s PID (i.e. Pg) and the reference to object C has been set to null, as 
it is transient. Finally, notice that during this entire operation, the in-memory 
image of object A remained unchaged. 

The current PJama system is using a scheme very similar to this to unswizzle 
objects. However, this has been achieved without the use of proper interfaces, 
which resulted in the abstractions within the system collapsing. We believe that 
the introduction of the unswizzling callbacks provides an elegant solution to this 
problem, while keeping the two components (i.e. the object cache and the store 
layer) well separated. 

Finally, it is worth pointing out that this implementation technique is not 
only applicable to systems with a one-to-one tight coupling between the virtual 
machine and the store (like PJama, as illustrated in Figure 1), but also to client- 
server architectures. In this case, on the client side, the unswizzling can take 
place on the buffer that will be used to transfer the data to the server. 

5.2 Evicting Objects 

In the context of PJama, Object Eviction is the operation that removes persistent 
objects from the object cache (see Figure 1), in order to make space for others 
to be faulted in^®. It is implemented entirely inside the memory management 
of the customised interpreter and is transparent to the programmer. Because of 
this, the eviction mechanism should not violate Definition 1. 

However, an evicted object can be faulted in by either the same invocation 
of the system that evicted it or a different one at a later time. This definition 
raises some interesting issues when considering eviction of objects that contain 
transient fields. In particular, when such an object is evicted, it is unclear what 
values the transient fields in its disk image should be set to. 

— If they are set to the default values, as it is natural, and the object is re- 
fetched within the same invocation of the system, the transient fields will be 
reset. This clashes with Definition 1. 

— If they are set to the values that they contained in memory, when the object 
was evicted, and the object is fetched in a different invocation of the system. 

It should not be confused with the page eviction operation of a virtual memory 
system. Such pages can only be evicted and fetched by the same invocation of a 
process and are discarded after that process has finished execution. 
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the value of the fields will most likely be invalid. This is especially severe if 
the transient fields happen to contain memory addresses, as it will introduce 
“dangling” references. 

One solution to the above problem is to always pin in the object cache all per- 
sistent objects containing transient fields and not allow them to be evicted. This 
should work well since, in theory, only a small portion of objects have transient 
fields. However, it might impose scalability problems for some applications. The 
obvious alternative of storing the values of the transient fields in auxialiary data 
structures, before the object is evicted, can introduce scalability problems as 
well. 

A refinement is to evict an object, only when its transient fields (if any) 
already contain the default values. This will mean that either the object has not 
been used or any external resources, which it pointed to, have been explicitely 
disposed of by the application and its transient fields have been reset. In either 
case, it is safe to evict the object, since its state will be valid, if it is faulted in 
by the same or a different invocation of the system. 

6 Conclusions 

This paper raises several issues concerning the interpretation and use of the 
transient keyword in the Java programming language, in the Java platform 
standard libraries, and in an orthogonally persistent environment such as PJama. 
We hope that, using concrete examples from the JDK platform and PJama li- 
braries, we have convincingly shown that 

— the transient keyword has not been defined sufficiently accurately in the 
JLS, 

— its current usage in the standard libraries of the JDK platform is inconsistent 
with the existing JLS definition, 

— its currently overloaded meaning requires persistence mechanisms to adopt 
ad-hoc workarounds, 

— the fundamental distinction between its use in a persistent and distributed 
context is not addressed, and 

— it is feasible to implement efficiently the handling of transient fields, while 
obeying Definition 1. 

We have proposed a definition. Definition 1, that clarifies the semantics of 
transient in a way which we believe appropriate for providing persistence to 
Java and other languages and have shown that this is practically feasible. We 
would like to see some of the issues presented in this paper included in future 
versions of the JLS and propagated to the standard libraries of the JDK plat- 
form. 

It is currently unclear whether the serialPersistentFields mechanism 
will eventually be adopted by the whole JOS community. The wide use of the 
transient keyword, in the context of JOS, and the requirements for backward 
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compatibility, might inhibit this. In this case, the proposal for the introduction 
of the serial and storage sub-modifiers would be a compromise that yields 
a more elegant solution. However, if we could change history, we would have 
adopted the following three keywords instead: 

— transient to mean not storable, 

— local to mean not transmittable, and 

— special to mean “special marshalling required”. 

We believe that future programming languages should include persistence and 
that this requires more care of the definition of transience. The management of 
the interplay between persistence and transience becomes important as we move 
towards open systems. 
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Abstract. A central problem in workflow concerns optimizing the distribution of work in 
a workflow: how should the execution of tasks and the management of tasks be distributed 
across multiple processing nodes (i.e., computers). In some cases task management or exe- 
cution may be at a processing node with limited functionality, and so it is useful to optimize 
translations of (sub-)workflow schemas into flowcharts, that can be executed in a restricted 
environment, e.g., in a scripting language or using a flowchart-based workflow engine. 

This paper presents a framework for optimizing the physical distribution of workflow 
schemas, and the mapping of sub-workflow schemas into flowcharts. We provide a general 
model for representing essentially any distribution of a workflow schema, and for represent- 
ing a broad variety of execution strategies. The model is based on families of “communicat- 
ing flowcharts” (CFs). In the framework, a workflow schema is first rewritten as a family of 
CFs that are essentially atomic and execute in parallel. The CFs can be grouped into “clus- 
ters”. Several CFs can be combined to form a single CF, which is useful when executing a 
sub-schema on a limited processor. Local rewriting rules are used to specify equivalence- 
preserving transformations. We developed a set of formulas to quantify the metrics used for 
choosing a near optimal set of CF clusters for executing a workflow. The current paper fo- 
cuses primarily on ECA-based workflow models, such as Flowmark, Meteor and Mentor, and 
condition-action based workflow models, such as ThinkSheet and Vortex. 

1 Introduction 

A workflow management system provides a framework and mechanism for orga- 
nizing the execution of multiple tasks, typically in support of a business or sci- 
entific process. A variety of workflow models and implementation strategies have 
been proposed [GHS95,WfM99]. Several recent projects have focused on devel- 
oping architectures and systems that support distributed execution of workflows 
[AMG“''95,DKM+96,WW97,BMR96]. A central problem concerns how to opti- 
mally distribute a workflow, i.e., how should the management and execution of tasks 
be distributed across multiple processing nodes? For example, while communication 
costs may increase with distribution, execution of task management on the same 
node that that executes the tasks may reduce communication costs and overall exe- 
cution time. In some cases, the processing node for executing a sub-workflow may 
have limited functionality. For this reason, it is useful to optimize translations of 
some sub-schemas into flowcharts, that can be executed in a restricted environment, 
e.g., in a scripting language or using a flowchart-based workflow engine. This paper 
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presents a framework and techniques for optimizing distributed execution of work- 
flows that includes the physical distribution of workflow schemas, and the mapping 
of sub-workflow schemas into flowcharts. 

The framework developed is similar to the framework used in relational database 
query optimization. In particular, it provides an abstract model for representing “log- 
ical execution plans” for workflow schemas. These plans give a partitioning of a 
workflow schema, indicate data and control flow dependencies between the par- 
titions, and also indicate how the workflow sub-schemas should be implemented. 
Analogous to the relational algebra, this model is not intended for end-users or 
workflow designers. Rather it captures relevant features concerning distribution and 
execution of many workflow models, including those based on flowcharts, on petri- 
nets (e.g., ICN [E1179]), on an event-condition-action (ECA) paradigm (e.g.. Flow- 
mark [LR94], Meteor [KS95], Mentor [WWWD96]), and on a condition- action 
(CA) paradigm (e.g., ThinkSheet [PYLS96], Vortex [HLS “'■99a]). Furthermore, the 
model is closed under a variety of equivalence-preserving transformations and re- 
writing rules. This permits the exploration of a broad space of possible implemen- 
tation strategies for a given workflow schema. 

The abstract model is based on families of “communicating flowcharts” (CFs); 
these are flowcharts with specialized mechanisms to support inbound and outbound 
data flow, and to track control flow information. In the framework, a workflow 
schema is first rewritten as a family of CFs which are essentially atomic. These can 
be viewed as executing in parallel, while satisfying the synchronization constraints 
implied by the original workflow schema. The CFs can be grouped into “clusters”; 
in one approach to implementation each cluster is executed on a separate process- 
ing node, and data can be shared with minimal cost between the CFs in a cluster. 
These parts of the framework are useful in studying the costs and benefits of dif- 
ferent distributions of the execution of a workflow schema, assuming that each of 
the processing nodes provides a workflow engine that supports parallel execution of 
workflow tasks within a single workflow instance. 

In some applications, it may be appropriate to execute a sub-workflow on a 
processing node that does not support a general-purpose workflow engine. One mo- 
tivation for this is to put task management on the same platform as task executions, 
e.g., if several related tasks are executed on the same platform. In such cases, the 
processing node might be a legacy system that does not support full-fledged parallel 
workflow execution. Indeed, the processing node might be part of a very limited 
legacy system, such as a component of a data or telecommunications network. A 
sub-workflow might be executed on such systems by translating it into a script, 
which is structured as a flowchart and invokes tasks in a synchronous fashion. 

In other applications, it may be desirable to execute a sub-workflow on a re- 
stricted, flowchart-based workflow engine, in order to take advantage of existing in- 
terfaces to back-end components, rather than installing a general-purpose workflow 
engine and building new interfaces. (The customer-care focused workflow system 
Mosaix [Mos] is one such example.) In the abstract model presented in the current 
paper, several communicating flowcharts can be composed to form a single commu- 
nicating flowchart; these are suitable for execution on limited processing nodes. 
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To summarize, there are three main components in the abstract model: 

(a) The use of one or more flowcharts, executing in parallel, to specify the internal 
operation of a workflow schema; 

(b) A coherent mechanism for describing how these flowcharts communicate with 
each other, including both synchronization and data flow; and 

(c) An approach that permits data flow and control flow dependencies to be treated 
in a uniform manner. 

In addition, the paper presents a first family of rewriting rules that can express a 
broad family of transformations on logical execution plans. 

So far we have discussed logical execution plans, which are high-level repre- 
sentations of how a workflow can be distributed. Our framework also includes the 
notion of “physical execution plan”, which incorporates information about what 
processing nodes different clusters will be executed on, what networks will be con- 
necting them, and how synchronization and data flow will be handled. We have also 
developed formulas for computing the costs of different physical execution plans, 
based on the key factors of response time and throughput. 

The key difference between our framework and that for query optimization is 
that Our framework is based on a language for coordinating tasks, rather than a 
language for manipulating data. As a result, the major cost factors and optimization 
techniques are different, and related to those found in distributed computing and 
code re-writing in compilers. 

The focus of this paper is on the presentation of a novel framework for optimiz- 
ing the distribution and execution of workflow schemas. Due to space limitations, 
the paper does not explore in depth the important issue of mechanisms to restrict or 
otherwise help in exploring the space of possible implementations. The techniques 
used in query optimization may be an appropriate starting point on this problem. 

In this paper we focus on compile-time analysis of workflow schemas, and map- 
ping of parallel workflows into flowcharts. However, we expect that the framework 
and principles explored here are also relevant to other situations. For example, de- 
cisions concerning the distribution of work for an individual instance of a workflow 
schema could be performed during runtime. Specifically, at different stages of the 
execution of the instance, the remaining portions of a workflow schema could be 
partitioned and distributed to different processors. Further, if required by some of 
the processors, selected sub-workflows could be translated into flowcharts. The spe- 
cific partitioning, and the optimal flowcharts, will typically be dependent on the data 
obtained from the processing of the workflow instance that has already occurred. 
Related Work: The area of optimization of workflow executions is a relatively new 
field, and few papers have been written on the topic. Distribution is clearly ad- 
vantageous to support scalability, which is becoming a serious issue as more and 
more users access popular web interfaces for digital libraries and online shopping 
etc. As noted above, several projects have developed distributed workflow systems 
[AMG“''95,DKM+96,WW97,BMR96], but to our knowledge the literature has not 
addressed the issue of optimal distributions of workflows. 

A family of run-time optimizations on centralized workflows is introduced in 
[HLS‘'‘99b,HKL+99], and studied in connection with the Vortex workflow model. 
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One kind of optimization is to determine that certain tasks are unneeded for suc- 
cessful execution of a workflow instance. This is especially useful in contexts where 
the workflow is focused on accomplishing specified “target” activities, and where 
intermediate activities can be omitted if not needed. Another kind of optimiza- 
tion, useful when parallel task execution is supported, is to support eager paral- 
lel execution of some tasks in order to reduce overall response time. References 
[HLS“'‘99b,HKL+99] explore how these and other optimizations can be supported 
at runtime. The current paper provides a complementary approach, focusing on the 
use of these and related ideas for optimizations used primarily at compile-time. 

Our model of CFs is closely related to that of communicating sequential pro- 
cesses (CSP) with two main differences: (1) in our model, a set of flowcharts (pro- 
cesses) starts at the beginning and they will not spawn new processes, and (2) we 
carefully distinguish between communications of getting an attribute value instan- 
taneously and that of getting the value only after it has been defined. 

Organization: §2 presents some motivating examples. §3 introduces a formal model 
of CFs, and describes approaches for implementing the model. §4 presents a phys- 
ical model for distributed workflows and analyzes key cost factors for them. §5 
presents a representative family of rules and transformations on clusters of CFs. §6 
offers brief conclusions. 

2 Motivating Examples 

We provide a brief introduction to the framework for workflow distribution, includ- 
ing: (a) the abstract model used to represent distributed workflow schemas; (b) the 
basic approach to partitioning workflow schemas across different nodes; and (c) 
techniques for creating and manipulating flowcharts which can execute workflow 
subschemas on limited processing nodes. §3 gives a formal presentation of the ab- 
stract model, and §3 and §4 discuss different options for implementing the model. 

Fig. 1 shows a workflow schema with 
5 tasks (shown as rectangles). We as- 
sume that each task has the form A = 

/ ( A, j , . . . , ) , where / is a function call 

that returns a value for attribute A, and 
may have side-effects. As typical with 
many workflow models, there are two 
activities with each task: task manage- 
ment and task execution. In a central- 
ized WFMS all tasks (and associated data 
management) are managed by one node, 
and the task executions are carried out by 
one or more other nodes. In our framework, the task management and execution 
may be performed by the same or different nodes. We generally assume in this pa- 
per that the data management associated with a task is performed by the node that 
is performing the task management, but this assumption can be relaxed. 

In workflow specifications, the attribute names have global scope, and so data 
flow is expressed implicitly, rather than using explicit data channels. This follows 




Fig. 1. Example ECA workflow schema 
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(a) (b) 

Fig. 2. Decomposition of schema into flowcharts, and partition onto two processing nodes 

ThinkSheet and Vortex, and contrasts with the syntax of models such as FlowMark 
or Meteor. The use of global attributes or explicit data channels does not affect the 
expressive power of the models. 

In Fig. 1 the edges indicate control flow. Except for Ai, the conditions in- 
clude both events and predicates. In this example the events have the simple form 
“Done(A)”, but other kinds of events can be incorporated. Data flow in this schema 
is indicated by how attributes are used in tasks; e.g. attribute A 4 is used for the task 
of computing . The semantics of this workflow schema follows the spirit of ECA- 
style workflow models, such as Elowmark, Meteor, Mentor. A task (rectangle node) 
should be executed if its enabling condition is true, i.e., if the event in that condition 
is raised, and if subsequently the propositional part of the condition is satisfied. 

We now use Fig. 2 to introduce key elements of the abstract model and our 
framework. First, focus on the 5 “atomic” flowcharts (ignore the two large boxes). 
Each corresponds to a single task in the schema of Fig. 1. These are communicating 
flowcharts (CFs). In principle, a parallel execution of these flowcharts is equivalent 
to an execution of the original workflow schema using a generic, parallel workflow 
engine. When these flowcharts are executed, attributes will be assigned an actual 
value if the corresponding enabling condition is true, and will be assigned the null 
value uj (for “disabled”) if the condition is evaluated and found to be false. 

The abstract model supports two different perspectives on attributes, that cor- 
respond to how attributes are evaluated in ECA models (e.g., ElowMark, Meteor, 
Mentor) vs. in condition-action models (e.g., ThinkSheet, Vortex). In ECA models, 
an attribute is read “immediately” when its value is needed, for this we use read{A) 
operations. In CA models an attribute is read only after the attribute has been ini- 
tialized, either by executing a task and receiving an actual value, or by determining 
that the attribute is disabled and receiving value oj. We use the operation get{A) to 
indicate that the value of A is to be read only after it is initialized. 

In addition to permitting support for different kinds of workflow models, the use 
of two perspectives on attribute reads permits us to treat events as a special kind of 
attribute. In Fig. 2 the events are represented as attributes Ei, using the get seman- 
tics; an attribute E is considered to remain uninitialized until some flowchart gives 
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import 

shared 

export 



E, E, A, A, 





(a) (b) 

Fig. 3. Two flowcharts, each equivalent to the three flowcharts of Fig. 2(b) 



it a value. (The launching of a flowchart for a given workflow is not represented 
explicitly in the abstract model, and different implementations of this are possible.) 

In Fig. 2 , each large square denotes a cluster of atomic flowcharts. Each cluster 
could be executed on a different node. The cluster includes a listing of attributes 
that are imported by the cluster, exported by it, or used internally (i.e., shared by the 
CFs within the cluster). We typically assume that each node has a data repository 
that maintains this data, which includes asynchronously receiving import data as it 
becomes available, and transmitting export data to the appropriate places. 

Fig. 2 shows one execution plan for the initial workflow schema of Fig. 1 , where 
tasks Ai and A2 are executing on one node, and tasks A^, A4, and A^ are executing 
on another node. Another execution plan is formed by moving the flowchart for task 
A3 from the second node to the first (and adjusting the schemas of the data reposi- 
tories accordingly). What is the difference in the costs of these two execution plans 
(call them Pi and P2)? There are various factors, including communication costs 
between the two clusters (an important issue here is the relative sizes of expected 
values for A2 and A4, and between the clusters and the location of task executions). 
Another factor, that may arise if one of the clusters is on a limited node, concerns 
the overall CPU and data management load that the cluster imposes on that node. 

Fig. 3 shows two ways that the three atomic CFs of Fig. 2 (b) can be combined 
to form larger CFs. Fig. 3 (a), shows one way to combine the three CFs into a single 
flowchart. This corresponds to a topological sort of the sub-schema of Fig. 1 involv- 
ing A3,A4,A3. An analogous flowchart could be constructed using the order A 4, 
A3, A3. If tasks A3 and/or A4 has a side-effect, then executing in one or the other 
order may be preferable, e.g., to have the side-effect occur as quickly as possible. 

Assume now that in the initial workflow tasks Ai, A3, and A3 are “target” tasks, 
but that the other two are for internal purposes only. In this case it would be desir- 
able, for a given workflow instance, to omit execution of A 4 if it is not needed for 
successful completion of the target tasks for that instance. Fig. 3 (b) shows CF that 
is equivalent to that of Fig. 3 (a). (§5 describes how the first CF can be transformed 
into the second one using a sequence of equivalence preserving rewrite rules.) In 
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particular, in Fig. 3(b) task A4 is not executed if task will not be executed, i.e., 
if E4 A ^4(^1 , A3) turns out to be false. 

Our abstract framework permits a more flexible semantics of workflow execu- 
tion than found in some ECA models, such as FlowMark or Meteor. In those mod- 
els, when an event fires the associated conditions should be queued immediately 
and tested soon thereafter, and likewise if the condition is true then the associated 
task should be queued Immediately and launched soon thereafter. In our model, we 
generally require only that conditions are tested sometime after the associated event, 
and tasks launched sometime after the associated condition comes true. More strin- 
gent timing requirements can be incorporated, if appropriate for some application. 

3 A Model of Communicating Flowchart Clusters (CFC) 

This section describes a formal model of communicating flowchart clusters for rep- 
resenting logical execution plans of tasks in workflow schemas, and discusses some 
implementation issues. 

We assume the existence of attributes (and domains). A task is an expression 
A = /(Al, ..., A„) where / is an (uninterpreted) function (procedure) name, A,’s 
are input attributes and A is an output attribute (defined hy the task). In general we 
treat tasks as black boxes, i.e., the functions are uninterpreted. However, we note 
that a task may have “side-effects”. 

A condition over attributes Ai, ..., A^, is a Boolean combination of predicates 
involving Ai , ..., A^, and constants in their respective domains; the attributes used 
in the condition are also called the input attributes to the condition. 

Definition. A flowchart is a tuple f = (T, C, E, s, L, I, O) where (1) T is a set 
of (attribute or task) nodes, (2) C is a set of (condition) nodes disjoint from T; (3) 
s € T U C is the entry node; (4) E C (T U C) x (T U C) such that (a) each t £ T 
has at most one outgoing edge, (b) each c G C has at most two outgoing edges, and 
(c) every x G T U C — {s} is reachable from s; (5) L maps each task node to a 
task and each condition node to a condition; (6) 7 is a set of import attributes used 
by some tasks in f; and (7) O is a set of export attributes defined in f. Moreover, the 
flowchart f is trivial if T U C = {s} and E is empty. 

Let f = (T, C, E, s, L, I, O) be a flowchart. We denote import and export at- 
tributes of f as m(f) = I and out{i) = O. In this paper we focus on “acyclic” 
flowcharts: A flowchart (T, C, E, s, L, I, O) is acyclic if the graph (T U C,EU D) 
has no cycles, where D = {{u,v) \ u defines an attribute used by r; }. 

Example 1. Fig. 2 (a) shows two flowcharts. The one in the left-hand side can be 
described as = (T, C, 7?, c, 7/, 0, {Ai, 7?i , i?2}), where T = {ai,...,a4}, C = 
{C2}, a 1,02 (resp. 03,04) are two nodes defining attribute A 2 (resp. 7?i,7?2), C2 
the entry condition node and E = {(c2, oi), (c2, 02), (oi , 03), (02,03), (03,04)}. 
In fact, fi is acyclic. 

Semantics (or execution) of a flowchart is defined in the straightforward way 
with the exception of acquiring values for import attributes. Initially all attributes 
have the null value _L (stands for “uninitialized”). One method of acquiring a value 
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for an import attribute A, called immediate read and denoted as read{A), is to re- 
trieve the current value of the attribute, regardless of whether A has a proper value 
or the null value. This method is used in many workflow systems such as FlowMark 
and Meteor. Immediate read is however undesirable sometimes because the timing 
of tasks and external events may cause delays in computing attribute values which 
may cause nondeterministic behaviors of the workflow system. 

In acyclic flowcharts, a task may be executed at most once. This allows an al- 
ternative method, called proper read, which does the following. When a value for 
an import attribute A is requested, if A has a proper value (non-_L), the value is 
fetched; if A currently has the value _L, the task requesting A waits until A is as- 
signed a proper value, and then the new value is fetched. We denote this operation as 
get{A). This operation is used in the Vortex paradigm which provides a declarative 
workflow specification language. 

From now on, we assume that each input attribute of every task must be specified 
with either a read or get operation. Although get operations provide a clean interac- 
tion between tasks, they may cause the execution of CFCs to “stall”. For instance, 
if there are no tasks defining A prior to a get{A) operation in a single flowchart, the 
get operation will be blocked forever. We assume that the flowcharts will never stall. 

Proper read operations alone do not guarantee determinism. A flowchart f = 
(T, C, E, s, L, I, O) is said to be write-once if in each execution of f, each attribute 
will be assigned a non-_L value at most once. It can be verified that every write-once 
flowchart with only proper read operations for all import attributes has a determin- 
istic behavior. The complexity of checking the write-once property of flowcharts 
depends on the condition language. 

Definition. A communicating flowchart cluster {CFC) is a quadruple (F,I,S,0) 
where F is a set of flowcharts with pairwise disjoint sets of nodes, I (resp. O) a 
set of import (resp. export) attributes for flowcharts in F such that I C Ufgi?m(f) 
(O C Ufgi?ouf(f)), and S = (Ufgi?m(f)) fl (Ufgi?ouf(f)) a set of attributes defined 
and used by different flowcharts in F. 

Example 2. Fig. 2(b) shows a single CFC that includes three flowcharts, where the 
import/export/shared attributes are listed in the top. 

Definition. Let F be a set of CFCs and Tg a set of attributes (called the target 
attributes of the CFCs). F is said to be well-formed w.r.t. Tg if (a) for each CFC 
F e F and each import attribute A in F, there is a CFC F' € F which exports A, 
and (b) each attribute of Tg is exported by some CFC in F. 

For each well-formed set F of CFCs, we define a dependency graph G f which 
characterizes the dependencies of control and data flows within flowcharts within 
F. Specifically, Gp = (V, E) where V is the union of all flowchart nodes in CFCs 
of F, and E contains all flowchart edges in CFCs of F and all edges {u, v) such that 
u defines an attribute that is used in u. A well-formed set F of CFCs is said to be 
acyclic if Gp has no cycles. The set of two CFCs in Fig. 2 is acyclic. 

Finally we discuss some key issues in developing implementation models for 
CFCs. Clearly, the control structures of flowcharts are very simple and generally 
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available in a variety of script languages; even a direct implementation (of the con- 
trol structures) is also straightforward. There are, however, three decisions that need 
further discussion. The first issue is the mapping from flowcharts to processors. The 
conceptual model of CFCs assumes that each processor may be capable of executing 
multiple flowcharts concurrently. Thus the mapping is simply to assign each CFC 
to a processor. The second issue is about how the flowcharts in a set of CFCs are 
invoked. This can be done by having a single entry to the CFC, that will spawn all 
flowcharts in the CFC (e.g., through remote procedure calls). 

The third issue is related to communication, i.e., passing attribute values be- 
tween tasks. There are three different ways an attribute value can be sent around. 
(1) The attribute is defined and used in the same flowchart. If the attribute is ac- 
cessed through a read operation, we just need to provide some storage space local 
to a flowchart for storing the attribute value. For acyclic flowcharts, since each get 
operation of an attribute is required to follow the task defining the attribute, get and 
read operations are effectively the same. (2) The attribute is defined and used in dif- 
ferent flowcharts in the same CFC. In this case, we need to provide a common buffer 
(write exclusive) for all flowcharts (processes) involved. While read operations do 
not require extra support, get operations would require a list to be maintained for 
each attribute that have not received a proper value. (3) In the most general case, 
the attribute is defined in one CFC and used in another. There are many ways to 
implement read and get. For example, both “producer” and “consumer” processors 
maintain a buffer for holding the attribute value. Passing the value from the producer 
buffer to the consumer buffer can be done by push or pull. Similar to case (2), get 
operations still rely on maintaining a list of requests. One possible architecture for 
handling communications is through a data repository. 

4 Optimization Issues in Choosing Execution Plans 

In this section we present a physical model for workflow distribution that extends 
the logical model of the previous section. We identify key cost factors in the physical 
model, and illustrate how they affect the cost of different physical execution plans. 
We then develop a very preliminary cost model, by presenting formulas that can 
be used to give bounds on the costs of physical execution plans. The two bounds 
we consider are: total running time for processing a workflow instance, maximum 
throughput, i.e., maximum number of instances processed in a time unit, and total 
network load resulted from transferring data between CFCs. 

4.1 A Physical Model for Executing Communicating Flowcharts 

We assume that each CFC is executed on a separate processing node. The exchange 
of attribute values between the CFCs can use push or pull approaches. The execution 
of tasks in a CF may involve accessing data at a data repository on either the local 
or a remote processing node. To capture the costs for transferring attribute values 
between a task manager and the task execution in a uniform way, we conceptually 
replace the task node in a CFC by three nodes: an entry node, a node for data re- 
trieval at the task execution machine, and an exit node. The in-edges of the original 
node are now in-edges of the entry node, while the out-edges of the original node 
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are now out-edges of the exit node. There is an edge from the entry node to the data 
retrieval node and one from there to the exit node. If the data repository is local, the 
data retrieval node remains in the same CFC. Otherwise, this node will reside in a 
separate CFC. 

When a new workflow instance is created, we assume for simplicity all the CFs 
are invoked for this instance at the same time. If an execution step in a CF depends 
on attrihute(s) which have not been defined, the CF waits for the dehnition of the 
attribute(s), i.e., until a true value or null value oj is present. 

Processing in a flowchart for a given workflow instance terminates when a leaf 
node of the flowchart completes. 

4.2 Comparing Alternative CFCs 

We give two examples here to illustrate the differences in performance and cost that 
alternative execution plans for the same workflow schema may have. We first show 
the difference in terms of response time and network load. Recall the example in Fig. 
2 (a) and (b). Suppose that computing attribute A3 (in CFC(b)) requires retrieving 
data from a different processing node where CFC(a) resides. A data retrieval node 
for A3 is then created in CFC(a) with two edges crossing the CFC boundary leading 
to the CFC(b) cluster. If we move the CF for computing A3 to CFC(a), these two 
edges will be eliminated, which most likely reduces network load and response time. 
This effect will be shown in our formulas presented in the next section. 

The second example concerns Fig. 3 . We show that we can transform the CFCs 
to eliminate unneeded evaluation of attributes during the execution of certain work- 
flow instances. In Fig. 3 (a), the node A 5 = a; is moved much “closer” to the root 
node of the CF. In the cases when A 5 is evaluated to a;, we can avoid an unneeded 
execution of task A4. 

4.3 Cost Factors 

The logical model of CFCs permits the management of tasks and actual execution 
of tasks to be on two distinct clusters. For ease of reasoning, we assume that task 
management nodes and task execution nodes are explicitly specihed in a flowchart 
plan f. The notion of clusters is thus generalized, since all nodes (management or 
execution) need to be on some processing node. In the rest of this section, we do not 
distinguish between the two types of management and execution nodes, and refer to 
them simply as flowchart nodes. 

We first define some notation. Let Fa represent the CFC in which an attribute A 
is computed. Let | A| denote the expected size of A in number of bytes (this includes 
cases where the transmitted value of A is a; or _L, which is viewed as having a size 
of one). Let {F, A) denote the probability of attribute A being used in the CFC F. 
Pd{A) denotes the probability of attribute A being computed. We also use Pd{X) to 
refer to the probability of any flowchart node X being executed, rather than simply 
the probability of an attribute being computed. For example, even condition nodes 
have probabilities associated with them, and reflect the probability of the condition 
node being evaluated (not the value of the condition being true or false). The actual 
meaning of X in Pd{X) should be evident from the context. 
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We now derive bounds on response time and achievable throughput. These 
bounds depend on the cost associated with executing a workflow instance, which can 
primarily be broken down into two parts: 1) processing cost of executing flowchart 
nodes within clusters, and 2) communication cost of passing data between clusters. 

Let W ork(A) denote the expected number of units of work required to compute 
node A for one workflow instance. Let Nodes{F) denote the set of nodes in clus- 
ter F. Work{F) denotes the total processing cost of cluster F for one workflow 
instance, and is given by: Work{F) = Y.AeNodes{F)(Pd{A) ■ WorktyA)). 

Communication cost depends on the amount of data transferred between clus- 
ters, the effective network bandwidth on the communication link between the clus- 
ters, and the method of communication between the clusters. We consider two pos- 
sible methods of communication, namely, push and pull. 

In the push mode, the value of an output attribute A is pushed to all the clusters 
F which may use that value, i.e., where A € in{F) . This can result in a reduction of 
response time by hiding communication latency, since the target attribute does not 
have to wait for a source attribute value to be delivered when required. Moreover, 
efficient multicast algorithms are available for distributing a value from a single 
source to multiple targets. However, one drawback of the push model is that it results 
in increased network load, since the source value is distributed to clusters which may 
not use the value in the actual execution. 

The pull mode defers communication until it is necessary. When a flowchart 
node is being computed that requires attribute values from other clusters, communi- 
cation is initiated to retrieve those values at that point. The advantage of this model 
is reduced network load, however, reduction in communication latency (as in the 
push mode) is not achieved. We only consider the pull mode in deriving expressions 
for communication costs. The costs for the push mode can be derived similarly. 

We assume we are given a function CommFi,F {d) that denotes the commu- 
nication cost of transferring d bytes of data from cluster F, to Fj. An example 
of this function would be: CommFi,Fj{d) = tsFi,Fj + tbFi,Fjd. where tsFi,Fj 
(resp. tbFi,F ) denotes the startup time (resp. byte transfer rate) for the communi- 
cation between clusters F, and Fj. However, any appropriate cost model can be 
chosen instead of the above. The total communication cost Comm is then given by: 

Comm = Y,Fei T,Aein(F) Pn{F,A) ■ CommF^,F{\A\) 

We use the term network load to refer to the total amount of data transferred be- 
tween the clusters during the execution of a workflow instance. As we explain later, 
network load is an important concept used for determining the maximum through- 
put. The total network load NL is given by: NL = '^Aein{F) Pu{F, A) ■ |A| 

Response Time: We now give present some worst case bounds on average instance 
response time for a particular flowchart plan f . We assume that the time for comput- 
ing attributes, and communicating attribute values between clusters, dominates the 
execution time. We assume that the source attributes for f are available at time = 0. 
We compute the expected response time of f by recursively computing the finish 
times of each flowchart node, by starting from the source nodes and proceeding in a 
topological sort order until the target nodes are reached. 
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Let Ta denote the finish time of the node A. Let pred{A) denote the predecessor 
set of node A in Fa, and input{A) denote the set of input attributes for node A. Let 
TenabledA denote the expected time instant when it is the turn of A to execute 
(note that the inputs of A might not be ready at this time). Then, T enabled a = 
'^Bepred{A)i^d(^) ' '^b) Now, let TreadyA denote the time when the inputs to A 
are also ready. The set of inputs to A can be broken down into two categories, 1) 
those that are computed in the same CFC, and 2) those computed in other CFCs. 
Note that the acyclicity property of the CFCs implies that the first set of inputs will 
always be computed by time TenabledA- For the second set of inputs, assuming 
actual read, the execution of A is delayed until those inputs are pulled from the 
corresponding CFCs. Hence, 

TreadyA=m&x[TenabledA,m&XBe(input(A)nin(FA))iTB+CommFB,FA{\B\))]- 

Note: for immediate read, the term Tb in this expression would not be present. If 
Time{A) represents the expected processing time of A, we get Ta = TreadyA + 
Time{A). The expected response time for the entire workflow is then given by 
m&XAeTg{TA)- The above analysis gives an approximation of the response time. 
To be more accurate, we should incorporate the value for the expected load on the 
processing nodes, since that impacts the processing time of each node. 



Throughput: The maximum achievable throughput is bounded by two factors: 1) 
processing power of the processing nodes, and 2) network bandwidth between the 
processing nodes. 

Let CapF denote the processing capacity (in terms of number of processing 
units per unit time) of the processing node that executes cluster F. Assuming inhnite 
network bandwidth, the maximum throughput achievable for a given processing 
node is given by the processing capacity of that node divided by the amount of work 
to be performed by the corresponding cluster. We denote this by Throughput p- 
Throughputp = wotV{F) ■ Similarly, we assume that the bandwidth of the link 
between two processing nodes (clusters) F, and Fj is Bandwidthpi ,f ■ ■ The set 
of attributes whose value may be transferred from F, and Fj is {Ipi fl Of . ), and 
those from Fj to F, is {Ip- H Op.). The maximum throughput achievable due to 
this link (denoted by Throughput Fi,F ) is given by the link bandwidth divided by 
the network load generated due to traffic on this link. Assuming a pull mode, we 

get Throughput p. p. = A)-|^|))+(i:Ae(i^. A)-|^D) ' 

Hence, given a flowchart plan, the maximum achievable throughput is given by: 
Throughput = vam{vam p^^{Throughputp) , Aim p p ^^{Throughputp^^p.)) 

Note that the above analysis presents upper bounds on the throughput. Actual 
throughput would vary depending on the expected network load. 



5 Representative Transformations and Rewriting Rules 

We now present some representative transformations and rewriting rules on logical 
execution plans. The first rule covers moving flowcharts between clusters, and the 
others focus on manipulating one or two CFs within a cluster. The application of 
these transformations should be guided by optimization goals and heuristics. These 
transformations are intended to be illustrative, but not necessarily complete. 
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The transformations rules take as input a logical execution plan, i.e., a family of 
CFCs, and produce as output another logical execution plan. To get started, given 
a workflow schema, we assume that an atomic CF is created for each task in that 
schema, and that a separate cluster is created for each CF. (An alternative would be 
to put all of the atomic CFs into a single cluster.) 

For this section we assume that all condition nodes have two outgoing edges, 
one for true and one for false, and that all tasks eventually finish (or are determined 
to be “dead”). These assumptions ensure that all possible executions of a flowchart 
will eventually reach a terminal node. 

5.1 Moving Flowcharts between Clusters 

We can move flowcharts from one cluster to another, using the move transformation. 

Move: This transformation moves a flowchart f from one CFC F\ to another 
CFC F2. In addition to removing f from F\ and adding f to F2, we also modify 
the sets of import, export and shared attributes of the two clusters to reflect the 
move: We remove those import attributes that are only used by f from the set of 
import attributes of F \ , and remove the attributes defined by f from the set of export 
attributes of F\ . Appropriate actions are also needed for F2 . 

For example, consider Fig. 2. If we move the flowchart for task A 4 from the 
CFC of (b) to that of (a), we need to (i) remove and A2 from the imports of (b), 
(ii) remove A4 from the shared attributes of (b), (iii) add A 4 to the imports of (b), 
(iv) add A4 to the exports of (a), (v) add and A2 to the shared attributes of (a), 
and (vi) remove and A2 from the exports of (a). 

5.2 Combining and Decomposing Flowcharts 

There is one rule for combining flowcharts, and one for decomposing them. 

Append: This transformation appends a flowchart (2 to another fi to produce a 
third f, provided that fi does not use any attribute defined in (2: the node set and 
edge set of f are the union of the respective sets of f 1 and ^2, and moreover f has the 
edge (u,v), for each terminal node u of f 1 and the entry node v of (2. 

For example, consider Fig. 2. If we append the flowchart f 2 for task A5 as entry 
node to the flowchart f 1 with for task A3 then we add an edge from E4 = Done{A3) 
to the root of (2. Observe that we cannot append these two flowcharts in the other 
order because of the data flow dependency between them. 

By combining the Append rule with the Reorder rules (given below) it is pos- 
sible to achieve other ways of combining flowcharts. As a simple example, given 
two linear flowcharts with no data flow dependencies between them, one can form 
a flowchart interleaves the tasks of the two flowcharts in an arbitrary manner. 

Split: This is the inverse of append; it splits one flowchart f into two flowcharts, 
fi and (2. Splitting of f can be done at any node v which is a node cut of f (viewed 
as a graph): We invent a new (event) attribute fi consists of everything of f 
except the branch of f starting from u; furthermore, the edge leading to v now points 
to a new node with the assignment task = True. (2 consists of the branch of 
f starting from v, plus a new condition node “E^^ = True” whose outgoing True 
edge goes to v, and outgoing False edge goes to a no-op terminal node. (We can 
avoid the use of if the branch from v does not depend on the rest of f.) 
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Fig. 4. Flowcharts arising at intermediate points in transformation from Fig. 3(a) to Fig. 3(b) 



Split can also be performed using larger node cuts, although it is more compli- 
cated. First, we need to create several exit events, one for each element of the node 
cut. Second, we have to construct series of condition nodes, that essentially perform 
a case statement that checks which event is true in order to start processing at the 
correct part of the split-off flowchart. 

5.3 Modifying a Flowchart 

We now present several representative rules for modifying individual flowcharts. 
These are essentially equivalent to transformations that are used for code rewriting 
in compiler optimization. We shall illustrate some of the transformations using Fig. 
4 , which shows two flowcharts that arise when transforming the flowchart of Fig. 
3 (a) into the flowchart of Fig. 3 (b). 

A flowchart is said to be single-entry-single-exit if it has exactly one entry 
and one exit node. Observe that a singleton node is a single-entry-single-exit sub- 
flowchart. For the sake of simplifying the discussion, we assume without loss of 
generality that the single exit node is a no-op terminal node, by appending one such 
node if necessary. Observe that this no-op terminal node is always absorbed by the 
subsequent entry node in the examples. 

Reorder: This transformation changes the ordering of two single-entry-single- 
exit sub-flowcharts fi and f2 when there is no data dependency between them. 
More specifically, let fi and f2 be two single-entry-single-exit sub-flowcharts of a 
flowchart such that (i) the exit node of fi goes to the entry node of f2, (ii) there 
are no other edges leaving f i , and (iii) there are no other edges entering f 2 . This 
transformation will then exchange the ordering of f 1 and f2. 

For example, consider Fig. 3 a. Let the four nodes headed by i?2 A P2(^i) be fi. 
and the three nodes headed by A ^3(^2) plus a no-op terminal node be f2 - Then 
the ordering of f 1 followed byf2 can be changed to f 2 followed by f 1 . 

Reorder through condition: In some cases, it is useful to push a single-entry- 
single-exit sub-flowchart through a condition node. In the case of pushing a sub- 
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flowchart downwards through a condition node, the sub-flowchart will be dupli- 
cated. Consider the node labeled Pi{Ai , A3). Let fi be the sub-flowchart of three 
nodes above that condition, along with a no-op terminal node. Fig. 4 (b) shows the 
result pushing fi downwards through the condition node. This transformation can 
also be applied in the opposite direction. 

Condition splitting; Suppose there is a condition node with label Ci A C2 . This 
can be split into two condition nodes, one for C\ and one for C2 in the natural 
manner. A similar transformation can be applied to conditions of form C \ V C2 . For 
example. Fig. 4 (a) can be obtained from Fig. 3 (a) by the following steps. First, split 
the condition A Pi{Ai , A3), and then use reordering to push upwards. 

Duplicate: Suppose there is a single-entry-single-exit sub-flowchart f 1 whose 
root has two (or more) in-edges. The duplicate rule permits creating a duplicate 
copy of fi, putting one copy below the first in-edge and the other copy below the 
second in-edge. If the tail of the original f 1 was not a terminal node, then the tails of 
the copies can be brought together and connected where the original tail was. 

Delegate: Intuitively, this transformation works as if it is delegating the task of 
a node to a different processor. This can be used to increase parallelism. For this to 
work correctly, we must ensure that the dependencies are taken care of properly. For- 
mally, we replace the old task node with a chain of two nodes: the first one spawns 
the action of the old task node (e.g., with keyword spawn{A)), and the second one 
is the action of waiting for that task to finish (e.g., by get{A)). Reordering can then 
be used to push the get{A) node downwards in the flowchart. 

Remove Unneeded; If a workflow schema has specified target and non-target 
tasks, then in some cases unneeded tasks can be detected an eliminated from 
flowcharts. For example, in Fig. 4 (b) the two right-most tasks assigning A 4 are not 
needed anywhere below. As a result, these tasks, and the condition node above, can 
be deleted. This shows how Fig. 4 (b) can be transformed into Fig. 3 (b). 

Other transformations (e.g. the unduplicate transformation) are also possible. 

6 Conclusions 

Our paper has focused on distributing workflow tasks among a set of processing 
nodes and on optimizing execution of the tasks. We developed a general framework 
for representing logical execution plans using communicating flowchart clusters and 
an initial set of transformations on logical executions plans. With intuitive examples 
from a simple workflow schema that can be defined in many workflow sysfems, 
we illusfrated some heuristics of applying the rewrite rules in optimizing execution 
plans. We also presented the main components of a model of physical execution 
plans, along a preliminary analysis of key cost factors. 

The technical results reported in this paper are preliminary; there are many in- 
teresting questions that deserve further investigation, ranging from theoretical foun- 
dation to practical application. In one extreme, there are many decision problems 
arising from this, such as: Is there a (sound and) complete set of rewrite rules? What 
is the complexity of testing if a set of CFCs have deterministic behavior? In the 
other extreme, it is fundamental to investigate further the issue of cost models, and 
heuristics for identifying efficient physical execution plans. It is also unclear how 
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this kind of compile-time optimization techniques compares with the ones based on 
adaptive scheduling such as [HLS“''99b,HKL+99]. 
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Abstract. Traditional database management requires design and en- 
sures declarativity. In the context of semistructured data a more flexible 
approach is appropriate due to missing schema information. In this pa- 
per we present a query language based on schema matching. Intuitively, 
a query is a pair consisting of what we want and how we want it. We 
propose that the former can be achieved by matching a (partial) schema 
and the latter by specifying additional operations. We describe in some 
detail our notion of schema covering various concepts typically found 
in query languages, such as predicates, variables and paths. We outline 
the optimization potential that this modular approach offers and discuss 
how we use constraints for query processing. 



1 Introduction 

Traditional database management requires design and ensures declarativity. Semi- 
structured data, “data that is neither raw data nor strictly typed” , lacks a fixed 
and rigid schema [Abi97]. Often their structure is irregular and implicit. Exam- 
ples for semistructured data include HTML files, BibTgX files or genome data 
stored in ASCII files. Recently, XML has emerged as a common syntactical rep- 
resentation for semistructured data. To us, the key problem of semistructured 
data is to move from content-based querying, i.e., the UNIX grep-command or 
simple WWW search engines, to structure-based querying, i.e., querying in an 
SQL-like manner. 

Before we present the idea of the paper we take a look at relational systems. 
We identify three layers: the operational layer, the schema layer and the instance 
layer. The tuples form the instance layer; and the tables form the schema layer. 
On the operational layer we find queries, views, constraints etc. The items of 
the operational layer are expressed using the items of the schema layer, i.e., 
queries are expressed using the tables. We would like to adapt this framework 
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of querying for semistructured data. As we have learned, the serious problem of 
semistructured data is its lack of complete, known-in-advance structure. In this 
query framework for relational data every item on the instance layer (i.e., every 
tuple) belongs to exactly one item on the schema layer (i.e., to exactly one table). 
Certainly, this requirement has to be relaxed in the context of semistructured 
data. 



Operations 



Schemata 



Schema query Focus query Transformation query 





Instances 



Fig. 1. A new query framework for semistructured data 




The query framework, adapted to cover semistructured data, is shown in 
Figure 1. Because semistructured data is usually represented as a graph, we show 
example graphs for the two bottom layers. Partial schemata in the middle layer 
conform to some parts of the database in the bottom layer. There is no further 
restriction, i.e., a partial schema can have an arbitrary number of instances in 
the database, and instances can conform to an arbitrary number of schemata. 

The crucial layer of this approach is the middle one, the layer of the schemata. 
There are a number of interesting questions: How do we get those schemata? 
What are they good for? How do we manage them? The simplest way to get them 
is from a database designer. Remember that the data is called semistructured 
rather than unstructured. So at least some parts of a database can potentially 
be modeled. A database designer may thus be able to provide some meaningful 
partial schemata. Another way to get partial schemata is the following. A query 
posed to the system uses both schema and operational layer. In other words, 
a query consists of a “What” -part (i.e., a partial schema) and a “How” -part 
(i.e., an operation). As an analogy, in the relational world we can consider a 
selection to correspond to the “What” -part and a projection correspond to the 
“How”-part of a query. Now, an obvious approach is to cache the “What” ’s, i.e., 
to extract partial schemata out of queries. To make this possible we lift some 
concepts typically found in queries (such as selection conditions) to the layer of 
the schemata. Partial schemata are useful for two main purposes. First, they can 
give users hints on the content of a database. Second, they can be used for query 
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optimization. Note, that schemata being good for the former are not necessarily 
good for the latter and vice versa. 

What are the advantages of our approach? A system designed in this w ay 
reflects the degree of structure of a database on many levels. If a database is 
w ell structuredthere will be large sc hemata withmany instances. Thus, users 
will get a lot of information about the data; and the performance of the system 
will be good as well. If, ho vever, the database is not well structured there will 
be only some useful schemata. Thus, the user will not get full knowledge about 
the database; and the performance will suffer as well. The schema layer can 
serve as an indication on the degree of structure of the database. The existence 
of large schemata with many instances indicates that the database is rather 
w ell structured. P arts of the database, that are not covered by an y schema, 
are probably not very in teresting or havea rather obscure structure. We will 
conclude this section with three paradigms that shall guide our approach: 

1. Answering a query works without schema information. 

2. Answering a query benefits from schema information. 

3. Answering a query induees new schema information. 

This paper is organized as folio ws. Section 2 presents the syn tactical data 
representation we are using. In Section 3 we define our semantically rich notion 
of schema that forms the base for the queries that are introduced in Section 4. 
The second part of the paper deals with query processing based on constraints. 
This is outlined in Section 5. We conclude with related work and a discussion. 



2 Labeled graphs as data representation 

In this section w edescribe the underlying data model of our proposal. It is a 
graph-based approach because graph models seem to be “the unifying idea in 
semi-structured data” [Bun97]. We try to be very general and do not require 
an y specific restrictions to our graphs. 

Definition 1 (T otal directed graph). A tuple G = (V,A,s,t) is a total 
directed graph if V is a set of vertices, A a set of arcs and, furthermore, s 
and t are total funetions from A to V assigning eaeh are its source and target 
vertex, respeetively. 

We also use the term no de instead of vertex. How ever, w use the term arc 
instead of edge to emphasize that we consider directed graphs. In our model tw o 
nodes can be linked by more than one arc. Cycles are allowed. The following 
definition introduces labels on vertices and arcs. 

Definition 2 (Labeled directed graph). L etC he an arbitrary set of labels. 
A tuple G = {V, A,s,t,l) is a (T-)labeled directed graph if (V,A,s,t) is a total 
direeted graph and I :VUA — > C is a total label function assigning eaeh vertex 
and are a lab el fromC. 
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Now, an object is a labeled directed graph. We also use the term database 
instead of object when we talk about a “large” object that is to be queried. 
Note, that we usually denote objects with lower-case letters (i.e., 01,02,...), 
but graphs with upper-case letters (i.e., G\,G2, H, . . .) in order to be consistent 
with both worlds. 

Figure 2 presents an example that we shall use throughout the paper. It 
shows a semistructured database on persons having names, surnames, a year 
of birth, a profession etc. Additionally, a sibling relationship relates different 
people. 




yearOfBirth 



Carpenter 



Harry Carpenter 

Fig. 2. A labeled directed graph 



For specifying answers to queries we will need the notion of a subobject of a 
database. This assumes some basic knowledge of partial orders, for an introduc- 
tion see [Tro92]. 

Definition 3 (Snbobject). An object 02 = (V t^°^\ ) is a 

subobject of oi = C C , 

g(o2) — s(°i) 1^(02), and ■ We denote this 

by 02 C oi . 

For a given object o we denote the set of all its subobjects by iP(o). 

Lemma 1 . For a given object o the structure [iP(o), C] is a partially ordered set, 
i.e., C is a reflexive, antisymmetric and transitive binary relation over^(o). 

Lemma 2 . For a given object o the structure [ip(o),C] is a lattice, i.e., every 
nonempty subset o/ip(o) has a least upper and a greatest lower bound. 



3 Schemata 

This section introduces the notions of schema and conformity between schema 
and object. Informally, a schema is an object that describes a set of objects. In 
the simpler syntactic framework of the label world this schema concept certainly 
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exists as well. One label might describe a set of other labels. This is frequently 
done - data types, predicates and regular expressions are examples. 

As as first step towards schemata in the graph world we assign schemata 
from the label world to the elements of the graph. We choose predicates to be 
the label world schemata. 

Definition 4 (Predicate schema). Given a set of unary predieates V , a pred- 
icate schema (over V) is an ohjeet s = where the ele- 
ments are labeled with predieates (I : U — > V ). 

We give an example in Figure 3. Note, that we treat a quoted constant c as 
an abbreviation for the predicate X = c. The predicate true{) is a wildcard; it 
holds for every label. 



true() 




’Carpenter’ 



Fig. 3. A simple predicate schema 



To establish a relationship between a schema and the objects described by 
it we must establish the notion of eonformity between schemata and objects. 
Depending on the direction of the mapping we say that we mateh a schema into 
an object or we interpret an object by a schema. 

Definition 5 (Naive conformity). A match of a predieate sehema s into an 
ohjeet o is an isomorphie embedding m of s into o, sueh that for all x € 
the predieate l^^\x) holds for l^°\m{x)). 

If there exists a match of the schema s into the object o we say that o 
eonforms to s and we call o an instanee (or also a mateh) of s. 

Let o be a database, s be a schema and oi C o a match of s. Then every 
object 02 with oi C 02 C o is also a match of s. Let 9Jl(^)(o) denote the set of 
all matches of s in o. Because 9Jl(^)(o) C ip(o) (in set semantics) [9Jl(^)(o), C] is 
also a partially ordered set. We call a minimal element of this partially ordered 
set a minimal mateh (or a minimal instanee) of s in o. We denote the set of 
minimal matches of s in o with •„ (o) . In Figure 4 we show the same schema 
as in Figure 3, but this time together with its minimal matches in the database 
in Figure 2. 

We enrich the semantics of our schemata by adding concepts typically found 
in query languages. Variables can be used to link (or join) different parts of a 
database based on the equality of labels. We add variables to our schemata in the 
following manner: Let s be a schema, V be a set of variables and r; : UA^^) — > 

V be a partial mapping from the nodes and arcs in the schema into the variables. 
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#1 #1 #2 




’Carpenter’ Carpenter Carpenter Carpenter 

(1) (2) (3) 

Fig. 4. The predicate schema and its minimal matches 



For a mapping to to be a match of s into an object o we additionally require 
for all xi,X2 G U that if v{xi) and v{x2) exist and v{xi) = v{x2) 

then l^°'>{m{xi)) = l^°\m{x2)) , i.e., the labels of the corresponding elements in 
the match are the same. A predicate schema with variables together with its 
minimal match is shown in Figure 5. Due to the very nature of semistructured 




X:true() X:true() Carpenter Carpenter 

( 1 ) 

Fig. 5. Adding variables 



data variables can be used to link data and structural parts of the database. 

For adding paths to our notion of schema we have to take into account 
structural aspects of graphs. Let G = (V,A,s,t) be a total directed graph. A 
trail is an arc sequence (a,j , . . . , aj„) where all are distinct and there exist 
nodes Vi„, . . . , , such that for all s{ai - ) = uq._j and t{ai - ) = Vi- hold. Note, 

that we do not require the , . . . , to be distinct. Thus, the notion of trail is 
more general than that of a path, yet the number of trails in an arbitrary graph 
is always finite, which makes it possible to handle cyclic structures. The number 
of arcs in a trail is called the length of the trail. Despite the fact that we are 
talking about trails we denote the set of all trails in a graph by P and the set of 
nonempty trails by P+, because from the intuition point of view we are talking 
about paths. For a nonempty trail pi = (a,j , . . . , aj„) G P+ we introduce source 
and target function sp,tp ■ P~^ — > V being defined in a canonical manner as 
sp(Pi) = s(a*i) and tp(pi) = t(ai^), respectively. 

Definition 6 (Corresponding trail graph). The corresponding trail graph 
to a graph G = is defined asGp = 

Intuitively, in the corresponding trail graph the trails are materialized as 
arcs. This notion is related to the notion of transitive closure of a graph. The 
only difference between the two notions is that in the transitive closure only 
one arc is included for every pair of reachable nodes, whereas we include an arc 
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for every trail via which they are reachable. Figure 6 shows three examples of 
directed graphs and their corresponding trail graphs. 

1 M 5 

(a) (b) (c) 

Fig. 6. Three directed graphs and their corresponding trail graphs 



Lemma 3. A directed graph is always a subgraph of its corresponding trail 
graph. 

The lemma holds, because there is a natural embedding a, — 1 (a,) of the 
arcs in A into the trails in 

Now we can extend our notion of schema. We introduce two additional func- 
tions qmin and Qmax that let us specify length constraints on paths in the match- 
ing objects. Furthermore, in order to incorporate the previously mentioned vari- 
ables, we need a set of variables V and a variable mapping v. 

Definition 7 (Schema). Given a set of labels C and a set of variables V a 
schemas is a tuple where 

1. ,A^^\s^^'> ,t^^'> ,l^^'> are defined as before, 

2. V : U 1 V is the variable mapping, a partial mapping from the 

nodes and arcs in the schema into the variables, and 

3. ql^ln ■ — > N+ and qmax '■ — > N"'' U {+oo} are length restrictions. 

Furthermore, if for an arbitrary arc a, € a variable binding exists, 

then ql^lnioi) = gmL(a*) = 1 holds. 

Of course we have to redefine the notion of conformity between schema and 
object. 

Definition 8 (Conformity). A match of a schema s into an object o is an iso- 
morphic embedding of s into op, i.e., an isomorphic embedding of A^^\ ,t^^ 

into P~^^°\sp\tp^), so that the following properties hold: 

1. For all nodes x € the predicate is true for l^°\m{x)). 

2. For all arcs x € the predicate l^^\x) is true for the labels l^°Hyj) of oil 
the arcs yj in the trail m(x). 

3. For all elements xi,X 2 G U for which n(^^(a;i) and v^-^\x 2 ) exist 
and v^'^\xi) = v^-^\x 2 ), the labels are the same l^°\m{xi)) = l^°\m{x 2 )) . 
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4- For all arcs x € the length of the trail m(x) is at least qmln(x) and no 
greater than qmaxix). 

If a match between a schema s and an object o exists we say that o conforms to 
s. 



If an object o conforms to a schema s we again call o an instance (or also a 
match) of s. To distinguish between functions that are matches and objects that 
are matches we also call the functions match functions. 

The following theorem states that we indeed enhanced our initial notion of 
schema, i.e., our new notion of schema does not contradict the initial one. Due 
to space limitations we omit the prove of this theorem. 

Theorem 1. A predicate schema s conforms to an object o in the naive manner 
if and only if it conforms to o, assuming that is the empty mapping and 
and qmhx equal one for all arcs in s. 

Consider the example in Figure 7. There is a ‘+’-sign on the first arc in the 
schema. It indicates that the length of the paths it matches is bound by 1 and 
+ 00 . So the schema matches everything that emanates from the root and leads 
to a ‘name’-arc. 



’Root’ 

true() + 

I itrueO 
'name' 

true() 



=> 



Root 

person 

<i#1 

name 



Root 

person 

!.#i 

brother 



Root 

person 

<i#2 

name 



Root 

person 

(.#3 

sister 



Root 

person 

(i#3 

name 



Carpenter 



>#2 t 
name Carpenter 



>#2 t 
name Smith 



Carpenter Carpenter 

(1) (2) (3) (4) (5) 



Fig. 7. Adding paths 



There are some subtleties here. A match of s in o is supposed to be a subobject 
of o. However, the scope m{s) of the match function to is a subobject of op. 
These subtleties become a serious problem when we want to adapt the definition 
of minimal match. The notion of minimal match is particularly important for the 
definition of queries as will be seen in the next section. Consider Figure 8. (We 
omitted the node labels there, because they are not relevant to this problem.) 

The schema on the left is matched to the database next to it. All the three 
matches are potentially “interesting” , but only the first one is minimal, because it 
is a subobject of the other two. Beside, if one of the matches was more interesting 
than the others, wouldn’t it be the one with the longest path, i.e., the one 
on the right? But we observe that all the three matches result from different 
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so (1) (2) (3) 

Fig. 8. A problem with the minimal matches of paths 



match functions. The scopes of their respective match functions are incomparable 
subobjects of op. Thus, we define minimal matches with respect to the match 
function. To achieve this we need a //atten-function that takes a subobject of 
Op and produces a subobject of o. Informally, flatten decomposes the trails 
into arcs and adds all their source and target nodes to the node set. Then 
we can define the set of minimal matches of s in o, denoted by as 

{flatten{m{s))\m is a match of s into o}. We observe that every flatten{m{s)) 
is indeed a match of s in o, because s can be embedded into flatten(m(s)) p using 
TO. Furthermore, for every match of s in o (i.e., every element of there 

is a minimal match in that is a subobject of the former match. With 

this revised definition all the three matches on the right hand side of Figure 8 
are minimal. 

4 Queries 

In this section we use the previously introduced schemata to define queries. All 
queries are based on matching schemata. Whereas the previous section dealt 
with the “What” -part of a query, this section deals with the “How” -part. 

A schema in itself already forms the most simple kind of query. It queries all 
subobjects of a database that conform to it. However, in such a case we would 
be interested only in the minimal matches. 

Definition 9 (Schema qnery). A schema query is a tuple q = (s) where s is 
a schema. The answer to q with respect to a database o is the set of minimal 
matches of s in o (o) . 

As an example you can imagine any of the schemata from the previous sec- 
tion. With a schema we can formulate conditions that any match must fulfill. 
This roughly corresponds to a selection in the relational world. However, we 
would like to have a concept that is comparable to a projection. 

Definition 10 (Focns qnery). A focus query is a tuple q = (si, S 2 ) where si is 
a schema and S 2 , the focus, is a subobject of si. The answer to q with respect to 
a database o is the union of the minimal matches of S 2 over all minimal matches 
of Si in o, i.e., U^gOT(;^i)(o) 
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The example in Figure 9 queries for the surnames of all persons with the 
name ’Carpenter’. The subschema S2 is indicated by the dashed box. 







Harry 



(1) 



Fig. 9. A focus query 



Sometimes we want to restructure the answer completely. Therefore we in- 
troduce the transformation query where we can specify a graph structure and 
compute new labels by using terms over the old ones. 

Definition 11 (Transformation qnery). A transformation query is a tuple 
q = (s,t) where s is a sehema and t is an objeet labeled with terms over the 
elements in s. The answer to q is built by ereating for every mateh of s in o a 
new objeet isomorphie to t, labeled with the evaluated terms of t, instantiating 
the terms by using the mateh. 

The example in Figure 10 queries for the age of Suzy Smith. The age is 
computed from the year of birth. 



trueQ 




s 



Xl 


#3 


lage’ 


|age 


• 

1999-X7 


• 

47 


t 


(1) 



Fig. 10. A transformation query 



Note, that schema and focus queries can be expressed as transformation 
queries. 

An obvious limitation of our approach is that we always get one answer per 
schema match. Thus, we currently do not support aggregation. However, we were 
able to express the operations of the relational algebra using our approach and 
the encoding for relational databases into graph models presented in [BDHS96]. 
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For querying semistructured data our appraoch has the property that no knowl- 
edge of a root node or specific paths going out from a root is necessary, because 
the approach is based on graph matching. 

5 Query processing using constraints 

In this section we outline our query processing technique. We focus on finding the 
matches for a given schema, because this part is the computationally challenging 
part. We start with the description of how to match schemata without any 
additional schema information given. A great benefit of our approach is that we 
can use previously matched schemata to speed up query processing. We outline 
this advantage at the end of the section. 

We base our query processing on constraint satisfaction techniques. There 
are at least two good reasons for this approach. First, the area of constraint 
satisfaction is well-studied with many techniques and heuristics available. Sec- 
ond, constraint satisfaction problems form a reasonably general class of search 
problems. Thus, we use a well-established framework for specifying our needs, 
for adapting our algorithms for richer kinds of schemata and, most important, 
for formulating our query processing based on previously matched schemata. 

Constraint satisfaction deals with solving problems by stating properties or 
constraints that any solution must fulfill. A Constraint Satisfaction Problem 
(CSP) is a tuple (W, ZJ, C) where 

— W is a set of variables {xi, . . . ,Xm}, 

— D is & set of finite domains Di for each variable Xi £ X and 

— C is a set of constraints {Csj , . . . , Cs„ } restricting the values that the vari- 
ables can simultaneously take. The 5, = (xsi^ ,■■■ , xsi ^ ) are arbitrary tuples 
of variables from X and each Cs^ is a relation over the crossproduct of the 
domains of these variables (Cc, C Da, x • • • x Da, ). 

Solving a CSP is finding assignments of values from the respective domains to 
the variables, so that all constraints are satisfied. In our context we are interested 
in finding all solutions of a CSP. 

The basic idea is as follows and is summarized in Figure 11. The database 
graph is transformed into suitable domains; and variables are introduced for 
the elements in the schema. Furthermore, constraints representing the match 
semantics are introduced. They can be categorized into the ones that represent 
the label part and the ones that represent the structural part of the match 
semantics. 

We depict the domains of the vertices and arcs from the database graph in 
Figure 2. 



Dv = {V1,V2,V3,. . . ,Vu} 
Da = {ai, 02 , as, . . . , 012} 



The example schema in Figure 12 (the same as in Figure 3) gives us the variables 
and the domain assignments. 
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Schema 



Database 








1: Domains 
2: Variables 
3: Constraints 




Fig. 11. Schema matching as a Constraint Satisfaction Problem 



true() 
x2 true() 

•x3 

’Carpenter’ 



Fig. 12. A simple predicate schema 



X = {xi,X2,Xz} 

D\ = = Dv 

D 2 = Da 

Constraints are derived from the labels in the schema . . . 

^(“i) = (^2), (vs), ■■■, (vii)} 

(“ 2 ), (as), ■ ■ ■ , (ais)} 

^Ixl) = {(^' 5 ),(^' 7 ),(^' 8 )} 

. . . and the structure of the schema. 

^(xlxi) = {(ai,vi),(a2,vi),(a3,vi),(a4,V2),(a5,V4),(a6,V2), 

(a7,V2), (as,V2), (a9,V3), (aio,V4), (aii,V4), (ai2,V4)} 

^(X2,xs) = {{ai,V2),{a2,V3),{a3,V4),{a4,V3),{a5,V3),{ae,V5), 

{ar,ve), (agjtzy), (ag,vs), (aio,vg), {au,vio), {ai2,vn)} 

Our sample CSP has the solutions (vg, (v 2 ,as,V 7 ), and (v 3 ,ag,vs) 

for the variables (xi,X 2 ,X 3 ). They correspond to the matches of the schema 
as shown in Figure 4. Note, that if injectivity of the match is to be ensured, 
additional constraints must be introduced. 

More details about this part of the work (e.g., variables and paths) and about 
techniques for solving CSPs can be found in [BF99]. 

We conclude this section by discussing how to find the matches of a schema 
using previously matched schemata. The basic underlying notion for this ap- 
proach is the notion of schema containment. It is related to the traditional notion 
of query containment. 
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Definition 12 (Schema containment). A schema si contains a schema S2 
if for all databases o all matches of S2 are also matches of si. 

If Si contains S2 

1. matches of S2 can only be found among the matches of si . If we want to find 
the matches of S2 and already have the ones for si we can reduce the search 
space. 

2. all matches of S2 are also matches of si. If we want to find the matches of 
Si and already have the ones for S2 we can present the first few matches 
immediately. There may exist more matches for si, though. 

Figure 13 shows three schemata. They contain one another left to right. 





true() 


• 


’name’ 


true() 


I 

true() 


(S1) 


(S2) 



true() 




true() true() 

(S3) 



Fig. 13. Schema containment 



Let us assume a notion of containment for predicates. pi contains p2 if for 
all labels x the implication P2{x) — > Pi{x) holds. Now, informally, a schema si 
contains another schema S2 if si is a subgraph of S2 and the predicates of si 
contain the respective predicates of S2 and the paths in si are no longer than 
the respective ones in S2. The other direction of this implication does not hold. 

We reduce the testing of these sufficient conditions for schema containment 
again to the Constraint Satisfaction Problem. Once we find a schema s' that 
contains our current schema s we can reduce the search space for the problem 
of finding the matches of s. 

In Figure 14 the schema s that is to be matched (the schema on top) is 
contained in the schema s' on the left. The variables x\, X2, and xs, that are 
introduced when constructing the CSP for s, correspond to yi, y2, and ys, re- 
spectively. In the matches for s' yi is matched to V2, V3, and V4] y2 is matched 
to 06, og, and oio; and ys is matched to V5, vs, and vg. Thus, we can construct 
the reduced domains for xi, X2, and xg. 

D\ = {v2,Vz,Va} 

D 2 = {a,6, cig, Oio) 

D3 = 

In order to fully capture the containment information we introduce an additional 
constraint. 

^(x‘i,X2,X3) = (^'3,a9,^'8),(^'4,al0,^'9)} 
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Fig. 14. Reducing the search space 



6 Related work 

Core issues and basic characterizations of semistructured data are discussed 
in [Abi97] and [Buu97]. Query languages include Lorel [MAG+97] and UnQL 
[BDHS96]. XML-QL is similar to our approach in that it uses so called element 
patterns as the “What” -part of a query [DFF+99]. Work on schema information 
for semistructured data concentrates on computing an ad hoc complete schema. 
One example are DataGuides [GW97,GW99]. Another notion of schema is in- 
troduced in [BDFS97]. Much related work arises also in the context of query 
languages suited for the Web and Web-site management systems ([AMM97], 
[FFK+98], [KS95], [LSS96], [MMM96]). A fundamental work on data stored in 
files is [ACM93]. Structured files are transformed to databases such that file 
querying and manipulating by using database technology becomes possible. 

We were inspired in our query processing by the area of graph transforma- 
tions. Graph transformations address the dynamic aspects of graphs. Systems 
are typically rule-based and can be used to model behavior. Thus, these sys- 
tems must incorporate techniques for graph pattern matching. Rudolf uses con- 
straint satisfaction techniques for simple graph pattern matching [Rud98]. A 
more database-like approach to this problem can be found in [Zue93]. 

An introduction to the field of constraint satisfaction is provided in [Bar98] 
and [Kum92]. They give various algorithms, heuristics, and useful background 
information to efficiently solve CSPs. A theoretical study of backtracking algo- 
rithms can be found in [KvB97]. 

7 Conclusion 

In this paper we have presented a flexible approach to querying semistructured 
data. It is based on the intuition that a query consists of a “What” -part and 
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a “How” -part. In contrast to more ad hoc query languages this split allows us 
to reuse the “what” -part for optimization. We have proposed a general graph 
model as the underlying data representation. The matching of a (partial) schema 
(representing the “What”) forms the base for querying. We proposed a rather 
rich kind of schema covering predicates, variables and paths. Bringing the notion 
of schema closer to that of a query allows us to reuse query results more easily. 
The “How” -part of a query comes in by defining how to manipulate schema 
matches. We have outlined how to process a query based on posing constraints. 
In particular, it is possible to make use of previously matched schemata. 

We have started to implement our ideas into a Prolog-based system and 
are currently switching to the commercial constraint solver ECLiPSe [Eel]. Ad- 
ditional future work lies in assessing schemata in order to determine “good” 
schemata for optimization. 
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Abstract. Semistructured databases are treated as dynamically typed: 
they come equipped with no independent schema or type system to con- 
strain the data. Query languages that are designed for semistructured 
data, even when used with structured data, typically ignore any type 
information that may be present. The consequences of this are what 
one would expect from using a dynamic type system with complex data: 
fewer guarantees on the correctness of applications. For example, a query 
that would cause a type error in a statically typed query language will 
return the empty set when applied to a semistructured representation of 
the same data. 

Much semistructured data originates in structured data. A semistruc- 
tured representation is useful when one wants to add data that does not 
conform to the original type or when one wants to combine sources of 
different types. However, the deviations from the prescribed types are 
often minor, and we believe that a better strategy than throwing away 
all type information is to preserve as much of it as possible. We describe 
a system of untagged union types that can accommodate variations in 
structure while still allowing a degree of static type checking. 

A novelty of this system is that it involves non-trivial equivalences among 
types, arising from a law of distributivity for records and unions: a value 
may be introduced with one type (e.g., a record containing a union) and 
used at another type (a union of records). We describe programming 
and query language constructs for dealing with such types, prove the 
soundness of the type system, and develop algorithms for subtyping and 
typechecking. 



1 Introduction 

Although semistructured data has, by definition, no schema, there are many 
cases in which the data obviously possesses some basic structure, perhaps with 
mild deviations from that structure. Moreover it typically has this structure be- 
cause it is derived from sources that have structure. In the process of annotating 
data or combining data from different sources one needs to accommodate the 
irregularities that are introduced by these processes. Because there is no way of 
describing “mildly irregular” structure, current approaches start by ignoring the 
structure completely, treating the data as some dynamically typed object such 
as a labelled graph and then, perhaps, attempting to recover some structure by 
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a variety of pattern matching and data mining techniques [NAM97,Ali99]. The 
purpose of this structure recovery is typically to provide optimization techniques 
for query evaluation or efficient storage storage structures, and it is partial. It is 
not intended as a technique for preserving the integrity of data or for any kind 
of static type-checking of applications. 

When data originates from some structured source, it is desirable to preserve 
that structure if at all possible. The typical cases in which one cannot require 
rigid conformance to a schema arise when one wants to annotate or modify 
the database with unanticipated structure or when one merges two databases 
with slight differences in structure. Rather than forgetting the original type 
and resorting to a completely dynamically type, we believe a more disciplined 
approach to maintaining type information is appropriate. We propose here a 
type system that can “degrade” gracefully if sources are added with variations 
in structure, while preserving the common structure of the sources where it 
exists. 

The advantages of this approach include: 

— The ability to check the correctness of programs and queries 
on semistructured data. Current semistructured query languages 
[BDHS96,AQM+96,DFF+] have no way of providing type errors - 
they typically return the empty answer on data whose type does not 
conform to the type assumed by the query. 

— The ability to create data at one type and query it at another (equiva- 
lent) type. This is a natural consequence of using a flexible type system for 
semistructured data. 

— New query language constructs that permit the efficient implementation of 
“case” expressions and increase the expressive power of a OQL-style query 
languages. 

As an example, biological databases often have a structure that can be ex- 
pressed naturally using a combination of tuples, records, and collection types. 
They are typically cast in special-purpose data formats, and there are groups 
of related databases, each expressed in some format that is a mild variation on 
some original format. These formats have an intended type, which could be ex- 
pressed in a number of notations. For example a source (sourcei) could have 
type 

set[ id: Int, 

description: Str, 

bibl: set[ title: Str, authors: list[name: Str, address: Str], year: Int...], 

...] 

A second source (source 2 ) might yield a closely related structure: 

set[ id: Int, 

description: Str, 

bibl: set[ title: Str, authors: list[/n: Str, In: Str, address: Str], year: Int • . .], 
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This differs only in the way in which author names are represented. (This 
example is fictional, but not far removed from what happens in practice.) 

The usual solution to this problem in conventional programming languages 
is to represent the union of the sources using some form of tagged union type: 

set(( tag^ : [ id: Int , . . . ], tag2 '■ [ id: Int , ... ] )). 

The difficulty with this solution is that a program such as 

for each x in sourcei do pr'\nt{x . description) (1) 

that worked on sourcei must now be modified to 

foreach x in sourcei union sourcc2 do 
case X of 

( tag I = 2/1 ) => pr\nt{yi. description) 

I ( tag 2 = 2/2 ) ^ pnnt{y2- description) 

in order to work on the union of the sources, even though the two branches of the 
case statement contain identical code! This is also true for the (few) database 
query languages that deal with tagged union types [BLS’*"94]. 

Contrast this with a typical semi-structured query: 

select [description = d, title = t] (2) 

where [ description = d, bibl = [ Title = /]]<— sourcei 

This query works by pattern matching based on the (dynamically deter- 
mined) structure of the data. Thus the same query works equally well against 
either of the two sources, and hence also against their union^. The drawback of 
this approach, however, is that incorrect queries - for example, queries that use 
a field that does not exist in either source - yield the empty set rather than an 
error. 

In this paper we define a system that combines the advantages of both ap- 
proaches, based on a system of type-safe untagged union types. As a first example, 
consider the two forms of the author field in the types above. We may write the 
union of these types as: 

[name: Str, address: Str] V [In: Str, fn: Str, address: Str] 

It is intuitively obvious that an address can always be extracted from a value of 
such a type. To express this formally, we begin by writing a multi-field record 
type [h : Ti, ?2 : T2 , . . .] as a product of single-field record types: : Ti] x [I2 : 

T2] X .... In this more basic form, the union type above is: 

{[name: Str] x [address: Str]) V {[In: Str] x [fn: Str] x [address: Str]) 

^ One could also achieve the same effect through the use of inheritance rather than 
union types in some object-oriented language. This would involve the introduction of 
named classes with explicit subclass assertions. As we shall shortly see, the number 
of possible classes is exponential in the size of the type. 
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We now invoke a distributivity law that allows us to treat 

[a : To] X {[b : Tb] V [c : T^]) and ([a : Ta] x [b : Tb]) V ([a : To] x [c : TJ) 

as equivalent types. Using this, the union type above rewrites to: 

{[name: Sir] V [fn: Sir x In: Sir]) x [address: Sir] 

In this form, it is evident that the the selection of the address field is an allowable 
operation. 

Type-equivalences like this distributivity rule allow us to introduce a value 
at one type and operate on it another type. Under this system both the program 
(1) and the query (2) above will type-check when extended to the union of the 
two sources. On the other hand, queries that reference a field that is not in either 
source will fail to type check. 

Some care is needed in designing the operations for manipulating values of 
union types. Usually, the interrogation operation for records is field selection and 
the corresponding operation for unions is a case expression. However it is not 
enough simply to use these two operations. Consider the type ([ai : Ti] V [bi : 
C/i]) X ... X {[an : Tn]y[bn : Un])- The form of this type warrants neither selecting a 
field nor using a case expression. We can, if we want, use distributivity to rewrite 
it into a disjunct of products, but the size of this disjunct is exponential in n 
and so, presumably, would be the corresponding case expression. We propose, 
instead, an extended pattern matching syntax that allows us to operate on the 
type in its original, compact, form. 

More sophisticated pattern matching operations may be useful additions even 
to existing semistructured query languages. Consider the problem of writing a 
query that produces a uniform output from a single source that contains two 
representations of names: 

( select [ description = d, name = n ] 
where [ description = d, bibl = [ author = [ name = n ] ] ] <— source ) 

union 

( select [ description = d, name = string -concat{f, 1) ] 
where [ description = d, bibl = [ author = [In = l,fn=f]]] <— source ) 

This is the only method known to the authors of expressing this query in current 
semistructured query languages. It suggests an inefficient execution model and 
may not have the intended semantics when, for example, the source is a list 
and one wants to preserve the order. Thus some enhancement to the syntax is 
desirable. 

This paper develops a type system based on imtagged union types along with 
operations to construct and deconstruct these types. In particular, we define a 
syntax of patterns that may be used both for an extended form of case expres- 
sion and as an extension to existing query languages for semi-structured data. 
We should remark that we cannot capture all aspects of semistructured query 
languages. For example, we have nothing that corresponds to “regular path ex- 
pressions” [BDHS96,AQM“^96]. However, we believe that for most examples of 
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“mildly” semistructured data - especially the forms that arise from the integra- 
tion of typed data sources - a language such as proposed here will be adequate. 
Our main technical contribution is a proof of the decidabiliity of subtyping for 
this type system (which is complicated by the non-trivial equivalences involving 
union and record types). 

To our knowledge, imtagged union types never been formalized in the con- 
text of database programming languages. Tagged union types have been sug- 
gested in several papers on data models [AH87,CM94] but have had minimal 
impact on the design of query languages. CPL [BLS+94], for example, can 
match on only one tag of a tagged union, and this is one of the few lan- 
guages that makes use of union types. Pattern matching has been recently ex- 
ploited in languages for semi-structured data and XML [BDHS96,DFF+]. In 
the programming languages and type theory communities, on the other hand, 
untagged union types have been studied extensively from a theoretical perspec- 
tive [Pie91,BDCd95,Hay91,Dani94,DCdP96, etc.], but the interactions of unions 
with higher-order function types have been shown to lead to significant com- 
plexities; the present system provides only a very limited form of function types 
(like most database query languages), and remains reasonably straightforward. 

Section 2 develops our language for programming with record and union 
types, including pattern matching primitives that can be used in both case 
expressions and query languages. Section 3 describes the system formally and 
demonstrates the decidability of subtyping and type equivalence. Proofs will be 
provided in the full paper. Section 4 offers concluding remarks. 

2 Programming with Union Types 

In this section we shall develop a syntax for the new programming constructs 
that are needed to deal with union types. The presentation is informal for the 
moment - more precise definitions appear in Section 3. We start with operations 
on records and extend these to work with unions of records; we then deal with 
operations on sets. Taken in conjunction with operations on records, these oper- 
ations are enough to define a simple query language. We also look at operations 
on more general union types and give examples of a “typecase” operation. 

2.1 Record Formation 

We have formulated record types [ h : U, ■ ■ • , : Ui ] as products [ /i : Ti ] x . . . x 

[In : Tn] of elementary or “singleton” record types. For record values, we use 
the standard presentation in terms of multi-field values [Zi = ei, ...,/„ = e„j. 

2.2 Case Expressions 

Records are decomposed through the use of case expressions. These allow us 
to take alternative actions based on the structure of values. We shall also be 



Union Types for Semistructured Data 



189 



able to use components of the syntax of case expressions in the development 
of matching constructs for query languages. The idea in developing a relatively 
complex syntax for the body of case expressions is that the structure of the body 
can be made to match the expected structure of the type of the value on which 
it is operating. There should be no need to “flatten” the type into disjunctive 
normal form and write a much larger case expression at that type. 

We start with a simple example: 

case e of [fn=f:Str, In = I'.Str] ^ string-concat{f,l) 

I [ name = n:Str ] => n 

This matches the result of evaluating e to one of two record types. If the 
result is a record with fn and In fields, the variables / and I are bound and 
the right-hand side of the first clause is evaluated. If the first pattern does not 
match, the second clause is tried. This case expression will work provided e has 
type [fn: Str,ln: Str] V [name: Str], 

We should note that pattern matching introduces identifiers such as l.,f,n 
in this example, and we shall make a short-sighted assumption that identifiers 
are introduced when they are associated with a type (x : T). This ignores the 
possibility of type inference. See [BLS+94] for a more sophisticated syntax for 
introducing identifiers in patterns. 

Field selection is given by a one-clause case expression: case e of [/ = 

x:T\^ X. 

We shall also allow case expressions to dispatch on the run-time type of an 
argument: 

case e of x:Int => x 

I y:set{Int) sum{y) 

This will typecheck when e : Int V set(Int) 

The clauses of a case expression have the form p => e, where p is a pattern 
that introduces (binds) identifiers which may occur free in the expression e. 
Thus each clause defines a function. Two or more functions can be combined by 
writing pi => ei | p 2 => 62 | . . . to form another function. The effect of the case 
expression case e of / is to apply this function to the result of evaluating e. 

Now suppose we want to extract information from a value of type 

{[name: Str] V [In: Str,fn: Str]) X [age: Int] (2) 

The age field may be extracted using field selection using a one-clause case 
expression as described above. However information from the left-hand compo- 
nent cannot be extracted by extending this case expression. What we need is 
need something that will turn a multi-clause function back into a pattern that 
binds a new identifier. We propose the syntax x as /, in which / is a multi-clause 
function. In the evaluation of x as f, f is applied to the appropriate structure 
and X is bound to the result. 
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case e of 

X as {[fn = f'.Str, In = l:Str ] string-concat{f, 1) \ [ name = n:Str ] n) 
4t^ [ age = a:Int ] 

^ [ name = x, age = a + 1] 

could be applied to an expression e of type (2) above. Note the use of ^ to 
combine two patterns so that they match on a product type. This symbol is 
used to concatenate patterns in the same way that it is used to concatenate 
record types. 

There are some useful extensions to case expressions and pattern matching 
that we shall briefly mention here but omit in the formal development (they 
are essentially syntactic sugar). The first is the addition of a “fall-through” or 
else branch of a case expression. The pattern else matches any value that has 
not been matched in a previous clause. Most programming languages have an 
analogous construct. 

Such branches are particularly useful if we allow constants in patterns. For 
example 

case e of [ name = n:Str, age = 21 ] e 

I else . . . 

Here only tuples with a specific value for age are matched. Tuples with a different 
value will be matched in the else clause. Note that patterns bind variables, and 
that if one allows constants in patterns, one wants to discriminate between those 
variables that are used as constants and those that are bound in the pattern. 
CPL [BLS"^94] uses a special marker to flag bound variables. In that language 
[name = n, age = \a] is a pattern in which a is bound and n is treated as a 
constant - it is bound in some outer scope. This extended syntax of patterns is 
especially convenient when used in query languages for sets. 

2.3 Sets 

We shall follow the approach to collection types given in [BNTW95]. It is known 
that both relational and complex-object languages can be expressed using this 
formalism. The operations for forming sets are {e} (singleton set) and e union e 
(set union). ^ For “iterating” over a set we use the form 

collect e where p <— e'. 

Here, e and e' are both expressions of set type, and p is a pattern as described 
above. The meaning of this is (informally) [J{(j{e) \ a{p) G e'}, in which cr is a 
substitution that binds the variables of p to match an element of e' . 

These operations, taken in conjunction with the record operations described 
above and an equality operation, may be used as the basis of practical query lan- 
guages. Conditionals and booleans may be added, but they can also be simulated 
with case expressions and some appropriately chosen constants. 

^ The present system does not include {} (empty set). It can be added, at the cost of 
a slight extension to the type system; see Section 4. 
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Unlike typed systems with tagged unions, in our system there is no formation 
operation directly associated with the union type. However we may want to 
introduce operators such as “relaxed set-union,” which takes two sets of type 
set(ti) and set(t 2 ) and returns a set of type set(tl V t2). 

2.4 Examples 

We conclude this section with some remarks on high-level query languages. A 
typical form of a query that makes use of pattern matching is: 

select e 
where Pi ^ ei, 

P2 ^ 62 , 

condition 

Here the pi are patterns and the expressions ei, . . . , Cj have set types. Variables 
introduced in pattern pi may be used in expression Cj and (as constants) in pat- 
tern pj where j > i. They may also be used in the expression e and the condition, 
which is simply a boolean expression. This query form can be implemented using 
the operations described in the previous section. 

As an example, here is a query based on the example types in the introduc- 
tion. We make use of the syntax of patterns as developed for case expressions, 
but here we are using them to match on elements of one or more input sets. 

select [ description = d, authName = a, year = y] 

where [description = d:Str, bibl = b:BT ] <— sourcei union source^, 

[ authors = aa:AT, year = y.Int] ^ b, 
a as ([/n = f'-Str, In = l:Str] string- concat{f , 1) \ 

[name = n\Str] n) ^ aa, 

y > 1991 

Note that we have assumed a “relaxed” union to combine the two sources. In 
the interests of consistency with the formal development, we have also inserted 
all types for identifiers, so AT and BT are names for the appropriate fragments 
of the expected source type. In many cases such types can be inferred. 

Here are two examples that show the use of paterns in matching on types 
rather than record structures. Examples of this kind are commomly used to 
illustrate the need for semistructured data. 

select X 

where a; as (s : set(Num) average(s) [ r : Num r) ^ source 
select s 

where s as (n : Str n | [fn = f'.Str, In = l:Str] 

string- concat{f , 1)) 



source 
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In the first case we have a set source that may contain both numbers and sets 
of numbers. In the second case we have a set that may contain both base types 
and record types. Both of these can be statically type-checked. If, for example, 
in the first query, s has type set(S'tr), the query would not type-check. 

To demonstrate the proposed syntax for the use of functions in patterns, 
here is one last (slightly contrived) example. We want to calculate the mass of 
a solid object that is either rectangular or a sphere. Each measure of length can 
be either integer or real. The type is 

[density. Real] 

X 

( [ intRadius : Int ] V [ realRadius : Real ] 

V 

( {[intHeight: Int] V [realH eight: Real]) 

X 

{[intWidth: Int] V [realWidth: Real]) 

X 

{[intDepth: Int] V [realDepth: Real]) ) ) 



The following case expression makes use of matching based on both unions 
and products of record structures. Note that the structure of the expression fol- 
lows that of the type. It would be possible to write an equivalent case expression 
for the disjunctive normal form for the type and avoid the use of the form x as/, 
but such an expression would be much larger than the one given here. 

case e of 

[ density = d:Real] 

# 

V as 

( r as {[intRadius 
r**3) 

I 

{ h as {[intHeight 

# 

w as {[intWidth 

# 

d as {[intDepth 
^ h* w * d) 

^ d*v 

3 Formal Development 

With the foregoing intuitions and examples in mind, we now proceed to the 
formal definition of our language, its type system, and its operational semantics. 
Along the way, we establish fundamental properties such as run-time safety and 
the decidability of sub typing and type-checking. 



= ir:Int] ^ float {ir) [ [realRadius = rr:Real] rr) 

= ih-.Int] ^ float {ih) [ [realHeight = rh:Real] rh) 

= iw.Int] float{iw) [ [realWidth = rw.Real] rw) 

= id:Int] ^ float {id) [ [realDepth = rd:Real] rd) 
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3.1 Types 

We develop a type system that is based on conventional complex object types, 
those that are constructed from the base types with record (tuple) and set 
constructors. As described in the introduction, the record constructors are [ ], 
the empty record type, [1: T], the singleton record type, and R x R, the dis- 
joint concatenation of two record types. (By disjoint we mean that the two 
record types have no field names in common.) Thus a conventional record type 
[h : Ti, ... ,ln ■ Tn] shorthand for [ : Ti ] x . . . x [ /„ : T„ ] . To this we add an 
untagged union type T V T. We also assume a single base type B and a set type 
set(T). Other collection types such as lists and multisets would behave similarly. 
The syntax of types is described by the following grammar: 



B 


base type 


[] 


empty record type 


[l-.T] 


labeling (single-field record type) 


Ti X T2 


record type concatenation 


Ti VT2 


union type 


set(T) 


set type 



3.2 Kinding 

We have already noted that certain operations on types are restricted. For ex- 
ample, we cannot form the product of two record types with a common field 
name. In order to control the formation of types we introduce a system of kinds, 
consisting of the kind of all types. Type, and a subkind Rcd(L), which is the kind 
of all record types whose labels are included in the label set L. 

K ::= Type kind of all types 

Rcd(T) kind of record types with (at most) labels L 

The kinding relation is defined as follows: 



B G Type 


(K-Base) 


[ ] G Rcd({}) 


(K-Empty) 


T G Type 
[l:T]e Rcd({0) 


(K-Field) 


5'GRcd(Li) TGRcd(L 2 ) Ti n L 2 = 0 

5” X T G Rcd(Ti U L 2 ) 


(K-Rcd) 



S &K TeK 



SyT&K 



(K-Union) 




194 Peter Buneman and Benjamin Pierce 



(K-Set) 



(K-Subsumption-1) 



(K-Subsumption-2) 

There are two important consequences of these rules. First, record kinds 
extend to the union type. For example, {[A : t] x [B : t]) x ([C : t]\/ [D : 
t]) has kind Rcd{{ A, B,C, D}). Second, the binding rules require the labels in 
a concatenation of two record types to be disjoint. (However the union type 
constructor is not limited in the same way; Int V Str and Int V [a : Str] are 
well-kinded types.) 

In what follows, we will assume that all types under consideration are well 
kinded. 



T G Type 
set(T) G Type 

T G Rcd(Li) 

T G Rcd(Ti U L2) 

T G Rcd(T) 

T G Type 



3.3 Subtyping 

As usual, the subtype relation written S <■ T captures a principle of “safe 
substitutibility” : any element of S may safely be used in a context expecting an 
element of T. 

For sets and records, the subtyping rules are the standard ones: set(S') c 
set(T) if S' <: T (e.g., a set of employees can be used as a set of people), and a 
record type S is a subtype of a record type T if S has more fields than T and the 
types of the common fields in S are subtypes of the corresponding fields in T. 
This effect is actually achieved by the combination of several rules below. This 
“exploded presentation” of record subtyping corresponds to our presentation 
of record types in terms of separate empty set, singleton, and concatenation 
constructors. 

For union types, the subtyping rules are a little more interesting. First, we 
axiomatize the fact that S' V T is the least upper bound of S and T - that is, 
S V T is above both S and T, and everything that is above both S and T is also 
above their union (rules S-Union-UB and S-Union-L below). We then have 
two rules (S-Dist-Rcd and S-Dist-Field) showing how union distributes over 
records. 

Formally, the subtype relation is the least relation on well-kinded types closed 
under the following rules. 



T <: T 

R<: S S <:T 



(S-Refl) 



R<: T 



(S-Trans) 
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[l-.T] 


<: N 


S X 


T 


<: S 


S xT 


<: 


T X S 


(TxU) 


<: 


(5 X T) : 


Sc 


S 


X [] 


S 


<: 


T 


[l:S] 


<: 


[l-.T] 


Si <: Ti 




S2 c T2 


Si X 52 


<: 


Ti X T2 


S 


<: 


T 


set(5) 


<: 


set(T) 


R<: T 




Sc T 


P V 




c T 



(S-Rcd-FE) 

(S-Rcd-RE) 

(S-Rcd-Comm) 

(S-Rcd-Assoc) 

(S-Rcd-Ident) 

(S-Rcd-DF) 

(S-Rcd-DR) 

(S-Set) 

(S-Union-L) 



Si <■■ Si V S 2 (S-Union-UB) 

i? X (S' V T) <: (i? X S') V (i? X T) (S-Dist-Rcd) 

[Z : S V T] <: [Z : S] V [^ : T] (S-Dist-Field) 

Note that we restrict the subtype relation to well-kinded types: S is never a 
subtype of T if either S or T is ill-kinded. (The typing rules will be careful 
only to “call” the subtype relation on types that are already known to be well 
kinded.) 

If both S <■ T and T <: S, we say that S and T are equivalent and write 
S ^ T. Note, for example, that the distributive laws S-Dist-Rcd and S-Dist- 
Field are actually equivalences: the other directions follow from the laws for 
union (plus transitivity). Also, note the absence of the “other” distributivity 
law for unions and records: P V (Q x i?) ~ (P V Q) x (P V i?). This law doesn’t 
make sense here, because it violates the binding constraint that products of 
record types can only be formed if the two types have disjoint label sets. 

The subtype relation includes explicit rules for associativity and commuta- 
tivity of the operator x . Also, it is easy to check that the associativity, commu- 
tativity and idempotence of V follow directly from the rules given. We shall take 
advantage of this fluidity in the following by writing both records and unions in 
a compound, n-ary form: 

[/i :Ti,...,;„ :T„] [h : Ti ] x • • • x [/„ :T„] 

\J{Ti,...,Tn) =^TiV---VP„ 
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(In the first line, n may be 0 — we allow empty records — but in the second, it 
must be positive — for simplicity, we do not allow “empty unions” in the present 
system. See Section 4.) 

We often write compound unions using a simple comprehension notation. For 
example, 

\/{AxB I Ae AiV^ 2 V...VA„ and S e BiVS 2 V...VS„) 
denotes 

AiXBi V A1XB2 V ... V AiXBn V A2XB1 V ... V AmXBn. 

3.4 Properties of Subtyping 

For proving properties of the subtype relation, it is convenient to work with 
types in a more constrained syntactic form: 

3.4.1 Definition: The sets of normal (N) and simple {A) types are defined as 
follows: 

1V::= 

A ::= B 

[ / i . Ai ^ j Iji . A>fi ] 

set(A^) 

Intuitively, a simple type is one in which unions only appear (immediately) inside 
of the set constructor; a normal type is a union of simple types. Note that every 
simple type is also normal. □ 

The restricted form of normal and simple types can be exploited to give a 
much simpler subtyping relation, written S' <1 T, in terms of the following “macro 
rules” : 



B<: B 
N <:M 



set(iV) <: set(M) 

{ , . . . , /tTTi} C for all ki ^ \^k\ , . . . , kjYi }, clfe. c Bki 

[ l\ ■ Ai ^ , ■ ■ ■ Im ■ ] S: [ 7 ■ ■ ■ 7 km : Bf^^ ] 

Vi < m. 3j <n.Ai<r Bj 
\J{Ai, . . . , Am) £; V('®li ■ ■ • 1 ^n) 

3.4.2 Fact: TV c M is decidable. 



(SA-Base) 

(SA-Set) 



(SA-Rcd) 

(SN-Union) 

□ 
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Proof: The macro rules can be read as a pair of algorithms, one for subtyping 
between simple types and one for subtyping between normal types. Both of 
these algorithms are syntax directed and obviously terminate on all inputs (all 
recursive calls reduce the size of the inputs) . □ 

3.4.3 Lemma: N ^ N, for all N. □ 

3.4.4 Lemma: If Af <: M and M then N ^ L. □ 

Proof: By induction on the total size of L, M, N. First suppose that all of 
L,M,N are simple. The induction hypothesis is immediately satisfied for SA- 
Base and SA-Set. For SA-Rcd use the transitivity of set inclusion and induc- 
tion on the appropriate subterms. 

If at least one of L, M, N is (non-trivially) normal, use the transitivity of 
the functional relationship expressed by the SN-Union rule and induction on the 
appropriate subterms. □ 

To transfer the property of decidability from <: to c , we first show how any 
type may be converted to an equivalent type in disjunctive normal form. 

3.4.5 Definition: The disjunctive normal form (dnf) of a type T is defined as 



follows: 

dnf{B) = B 

dnf{[]) =[] 

dnf{P X Q) = \J{Ai X Bj \ Ai € dnf{P), Bj G dnf{Q)) (a) 

dnfifl : P]) = V([Z : A,] | A, € dnf{P)) (b) 

dnf{P V Q) = dnf{P) V dnf{Q) (c) 

dnf{set{P)) = set{dnf{P)) (d) □ 

3.4.6 Fact: dnf{P) ~ P. □ 

3.4.7 Fact: dnf{P) is a normal type, for every type P. □ 

3.4.8 Fact: N ^ M implies N <■ M □ 

3.4.9 Lemma: S <■ T iS dnf{S) <: dnf{T) □ 



Proof: (<J=) By 3.4.6 we have derivations of S c dnf{S) and dnf{T) <: T, and 
by 3.4.8 we have a derivation of dnf{S) <■ dnf{T). Use transitivity to build a 
derivation of S <■ T. 

(=>) By induction on the height of the derivation of S <■ T. We consider the 
final rule in the derivation. By induction we assume we can build a derivation 
of the normal forms for the antecedents, and now we consider all possible final 
rules. 

We start with the axioms. 

(S-Refl) By reflexivity of <: (3.4.3). 

(S-Rcd-FE) dnf([l : r]) = [Z : dnf(T)], and [/ : dnf(T]) <1 [ ] by SA-RCD. 
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(S-Rcd-RE) dnf{S X T) = \J{S^ x Tj \ S, S dnj{S), T, € dnf{T)). Now 
dnf{Si) X dnflTj) c dnf{Si) by SA-RCD, and the result follows from SN- 
Union. 

(S-Rcd-Comm) 

If dnf(S) and dnf{T) are simple then dnf{S) x dnf{T) <: dnf(T) x dnf{S) by 
SA-Rcd. If not, use SN-Union first. 

(S-Rcd- A ssoc) As for S-Rcd-Comm. 

(S-Rcd-Ident) As for S-Rcd-Comm. 

(S-Union-UB) By SN-Union. 

(S-Dist-Rcd) 

dnf{R X (5 V T)) = \J{R, x Uj \ Ri G dnf{R), Uj G dnf{S V T)) 

= y{Ri X Sj I Ri G dnf{R), Uj G dnf{S)) V 

V(R* X Tfe I i?, G dnfiR), n G dnf{T)) 

= dnf{{R X S') V (A X T)) 

(S-Dist-Field) 

dn/([Z : S V T]) = V([^ ■ U,] \ C7* G dn/(S V T)) 

= V([/ : S,] I S, G dn/(S)) V V([^ ■ T^] \ T, G dn/(T)) 

= dnf([l:S])\/ dnf([l:T]) 

= dnf{[l:S]W[l:T]) 

Now for the inference rules. The premises for all the rules are of the form 
S <• T and our inductive hypothesis is that for the premises of the final rule we 
have obtained a derivation using SA-* and SN-Union rules of the corresponding 
dnf{S) <: dnfiT) Without loss of generality we may assume that the final rule 
in the derivation of each such premise is SN-Union. We examine the remaining 
inference rules. 

(S-Trans) By Lemma 3.4.4. 

(S-Rcd-DF) Since dnf{S) <: dnf{T) was derived by SN-Union we know that 
for each Ai G dnf{S) there is a Bj G dnf{T) such that Ai <: Bj. Therefore, 
for each such Ai, we may use SA-Rcd to derive [I : Ai] ^ [I : Bj], These 
derivations may be combined using SN-Union to obtain a derivation of 
dnf{[l: S])<:dnfi[l:T]). 

(S-Rcd-DR) For each A\_^ G dnf{S\) and each A^^ G dnf{S 2 ) there exist G 
dnf{Ti) and Bj^ G dnf{T 2 ) such that we have a derivations of c 
and <: R|^. For each such pair we can therefore use SA-Rcd to de- 
rive A)^ X Af^ <: Bj^ X Bj^ and then use SN-Union to derive dnf{Si x 
Rs) <i dnf{Ti X T 2 ). 

(S-Set) Immediate, by SA-Set. 

(S-Union-L) For each Ai G dnf{R) there is a Cj G dnf{T) such that Ai <: Cj 
and for each Rfc G dnJ{S) there is a C/ G dnf{T) such that Rfc <}, Ci. From 
these dnf{R V R) <: dnf(T) can be derived directly using SN-Union. □ 

3.4.10 Theorem: The subtype relation is decidable. □ 



Proof: Immediate from Lemmas 3.4.9 and 3.4.2. 



□ 
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We do not yet have any results on the complexity of checking subtyping (or 
equivalence). (The proof strategy we have adopted here leads to an algorithm 
with running time exponential in the size of its inputs.) 

The structured form of the macro rules can be used to derive several inversion 
properties, which will be useful later in reasoning about the typing relation. 

3.4.11 Corollary: If S <■ set(Ti), then S = set(S'i), with c Ti. □ 

3.4.12 Corollary: If W <■ U, with 

W=[h:Wi]x ... x[lra:Wm]x ... x[L:Wn] 

U=[h:Ui]x ... X [Im-.Ura], 

then Wk <■ Uk for each k <m. □ 

Proof: From the definition of disjunctive normal forms, we know that dnfiW) = 
V([^1 : VF,1 ] X . . . x[l^-.w,^]x ... X [Z„ : ] | Wa ■ ■ -W,m ■ ■ 

dnf{Wi) . . . dnfiWrn) ■ ■ ■ dnfiWn)) and dnf{U) = V([^i ■ Uji] x ... x [Im ■■ 
] I Uji... e dnfiUi) . . . dnfiUm))- By SN-Union, 

for each A* = [ : Wn ] x .. . x [Im : Wim ] x ... x [1^ : Win ] G dnf{W) 
there is some Bj = [/i : Uji ] x . . . x [1^ '■ Ujm ] G dnf{U) 
with Ai <: Bj . 

This derivation must be an instance of Sa-Rcd, with Wik <i Ujk. In other 
words, for each Wik G dnf{Wk) there is some Ujk G dnf{Uk) with Wit <1 Ujk- 
By SN-Union, dnf{Wk) <; dnf{Uk). The desired result, Wk <■ Uk now follows by 
Lemma 3.4.9. □ 

3.4.13 Corollary: If S' is a simple type and Sc Ti V T2, then either S c Ti 

or else S <■ T 2 . □ 

3.5 Terms 

The sets of programs, functions, and patterns are described by the following 
grammar: 

e ::= b base value 

X variable 

[ = ei, ... ,ln = On] record construction 
case e of f pattern matching 

{ei,...,e„} set 

6i union 62 union of sets 

collect 6i where p <— 62 set comprehension 

p ::= X : T variable pattern (typecase) 

[ = Pi, ■ ■ ■ tU = Pn] record pattern 

Pi if P2 pattern concatenation 

X os f function nested in pattern 



/ ::=p^ 6 

/i I /2 



base function 
compound function 



200 Peter Buneman and Benjamin Pierce 



3.6 Typing 



The typing rules are quite standard. 
Expressions (T h e € T) 




Th be B 


(T-Base) 


T h cc G r{x) 


(T-Var) 


r \- Ci G Ti all the k are distinct 


(T-Rcd) 


T h [ = Cl, = e„ ] G [^1 : Ti ] x • • • x [In ■ Tn] 


r h / G S^T TheGi? R<: S 

r h case e of / G T 


(T-Case) 


r \- Ci G Ti for each i n > 1 

T h { ei, . . . , e„ } G set(Ti V • • • V T„) 


(T-Set) 


T h ei G set(Ti) F \- 62 G set(T2) 

T h ei union 62 G set(Ti V T2) 


(T-Union) 


r h 62 G set( 5 ') r'rpGU ^ F' ScU 

F, r' h 6i G set(T) 

F h collect 6i where p ^ 62 G set(T) 


(T-Collect) 


Functions {F \- f G S^T) 




F^PGS^ F' F,F'^eGT 

F^p^eG S^T 


(TF-Pat) 


T h /i G Si^Ti T h /2 G S2^T2 

F^h [ /2 G V 52 ^Ti V T2 


(TF-Alt) 


Patterns {F \- p G T ^ F') 




T G K 

F\-x:TgT^x:T 


(TP-Var) 



r \- Pi G Ti ^ r' all the k are distinct the T/ all have disjoint domains 

r \- [li = Pi, . . . ,ln = Pn] & [h ■ Ti] X ■ ■ ■ X [In ■ Tn] ^ r{, . . . , 

(TP-Rcd) 
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r Px & [ki : Si, . . . ,kra '■ Sm] ^ r[ F P2 & [W ■ Ti, . . . ,ln : Tn] ^ Fl^ 
{k\, . . . , km} n {h, . . . , In} = 0 F{ and F 2 have disjoint domains 

r Pi ^ P2 & [kl ■ Si, . . . ,km ■ Sm,h ■ Ti, . . . ,ln ■■ Tn] ^ F{,F 2 

(TP-Concat) 



P h / G S^T 
F\~x as fGS^x'.T 



(TP-As) 



3.7 Properties of Typing 

3.7.1 Proposition: The typing relation is decidable. □ 

Proof: Immediate from the decidability of subtyping and the syntax-direct- 
edness of the typing rules. □ 

3.7.2 Definition: A substitution cr is a finite function from variables to terms. 

We say that a substitution cr satisfies a context S, written cr |= A, if they have 
the same domain and, for each x in their common domain, we have h cr(x) G Sx 
for some Sx with Sx <• S(x). □ 

3.7.3 Definition: We say that a typing context F refines another context F' , 

written F <: P', if their domains are the same and, for each x G dom{F), we 
have F{x) <■ F'{x). □ 

3.7.4 Fact [Narrowing]: If P h e G P and P <: F' , then P' h e G T. □ 

3.7.5 Lemma [Substitution preserves typing]: 

1. If A ^ cr and A, A h e G Q then A h cr(e) G P, for some P <■ Q. 

2. If A ^ cr and E, A\- f G S ^ Q then A h cr(/) G S' ^ P, for some P <: Q. 

3. If A ^ cr and A, A h p G P A' then A h cr(p) G P A", for some 

A" <: A. □ 

Proof: By simultaneous induction on derivations. The arguments are all 
straightforward, using previously established facts. (For the second property, 
note that substitution into a pattern only affects functions that may be embed- 
ded in the pattern, since all other variables mentioned in the pattern are binding 
occurrences. Moreover, by our conventions about names of bound variables, we 
must assume that the variables bound in an expression, function, or pattern are 
distinct from those defined by cr.) □ 

3.8 Evaluation 

The operational semantics of our language is again quite standard: we define a 
relation e i} v, read “(closed) expression e evaluates to result f,” by a collection 
of syntax-directed rules embodying a simple abstract machine. 
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3.8.1 Definition: We will use the metavariables v and w to range over values 
- closed expressions not involving case, union, or collect. 

V ::= b 

[ — 'Cl , ■ ■ ■ , — '^n ] 

{vi,...,Vn} 

We write v as shorthand for a set of values ui, □ 

3.8.2 Definition: A substitution ct is a finite function from variables to values. 
When (Ti and (J2 have disjoint domains, we write cti + (J2 for their combination. 
□ 



Reduction {e ij- v, for closed terms e) 



b b 


(E-Base) 


Ci IJ. Vi for each i 

[ll — — Gn ] 'Ij' [^1 — Vi j ... j In — Vn ] 


(E-Rcd) 


e V Tnatch{v, f) => v' 

case / of e U- u' 


(E-Case) 


Ci IJ. Vi for each i 
{ei, . . . ,e„ } Ij. { ui, . . . ,u„ } 


(E-Set) 


ei Jj. { Cl } 62 Ij. { C 2 } 

6i union 62 Ij. { wi U V2 } 


(E-Union) 


62 jj. { Cl, . . . ,c„ } 

for each i, match{vi, p) ai and ai{ei) jj. { Wi } 

collect 6 i where p 4— 62 Ij. { wi U • • • U u 7 „ } 


(E-Collect) 


Function matching {match{v, f) v') 




match{v, p) ^ a cr(e) jj. v' 

match{v, p ^ e) ^ v' 


(EF-Pat) 


match{v, fi) c' 

match{v, fi \ f2) ^ v' 


(EF-Alt1) 


-^{match{v, fi)) match{v, /2) v' 

match{v, fi \ f2)^v' 


(EF-Alt2) 



Matching {match{y, p) ^ F) 
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hv€S S<:T 
match{v, x : T) ^ x = v 



(EP-Var) 



match{vi, pi) ai the ai have disjoint domains 

UlQjtch ( [ — V\ , . . . , — VjYi , ■ ■ . , In — t’n ] ; [ — Pi — Pm ] ) 

=i> (Ti + . . . + Um 

(EP-Rcd) 

match{v, p\) a\ match{v, P2) <J2 cfi and (T 2 have disjoint domains 

match{v, pi # P2) ^ (Ji+ <72 

(EP-Concat) 



match{v, f) v' 
match{v, x as f) ^ x = v' 



(EP-As) 



3.9 Properties of Evaluation 

3.9.1 Fact: If u is a value and \~ v & V , then E is a simple type. □ 

3.9.2 Theorem [Subject reduction]: 

1 . If e ii-v 

e (z Q, 

then \- V G V 

E<: Q. 

2. If match{v, /) 

h / G U^V 
hvGW 
W < : U, 
then \- v' G X 
X <: V. 

3. If match{v, p) cr 

hvGW 
hpGU ^ X 
W < : U, 

then E \= a. □ 

Proof: By simultaneous induction on evaluation derivations. 

1. Straightforward, using part (2) of the induction hypothesis for the E-Case 
case and part (3) for E-Collect. 

2. Consider the final rule in the given derivation. 

Case EF-Pat: f = p ^ e 

match(v, p) cr 
cr(e) U- v' 

From h / G C/— *-E, we know G p G U ^ E and E h e G V. By part (3) of 
the induction hypothesis, A |= cr. By Lemma 3.7.5, h a(e) G V. Now, by the 
induction hypothesis, G v' G X and X c E, as required. 
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Case EF-AltI: f = \ 

match {v, fi) => 

From rule TF-Alt, we see that h /i € C/i— >Vi and F /2 S U2—^V2, with 
U = U\ y U2 and V = V\ \/ ¥2- The induction hypothesis yields \~ v' & X 
with Ac Vi, from which the result follows immediately by S-Union. 

Case EF-ALT2: f = \ 

~^{match{v, fi)) 
match {v, f2) => u' 

Similar. 

3. Consider the final rule in the given derivation. 

Case EP-Var: p = {x : T) 

hvGS 
S<:T 
a = {x = v) 

X={x:T) 

Immediate. 

Case EP-RCDI V — — Ui, ..., Ijji — Vm 7 ■ ■ ■ J — '^n ] 

P — [ — Pi 7 — Pm ] 

match(vi, pi) ai 
cr = (Tl + ... + cr + TO 

From T-Rcd, we have W = [li : Wi ] x . . . x : Wn] and \~ Vi &Wi for 
each i. Similarly, by TP-Rcd, we have U = [l\ \ U\] x . . . 'x [I m '■ Wm] 
with \- Pi & Ui ^ Si. Finally, by Corollary 3.4.12, we see that Wi c Ui. 
Now, by the induction hypothesis, Ei |= ai for each i. But this means that 
E \= a, as required. 

Case EP-Concat: P = Pi ^ P2 

match (v, Pi) ^ ai match(v, P2) ^ 2 
a I and a 2 have disjoint domains 

tr = (Ti + CT + 2 

By TP-Concat, we have U = [fci : Si,.. . : S'™, : Ti,. . . ,ln : T„] 

with 'a Pi & [ki \ Si, . . . ,km ■. Sm] ^ El and F P2 G [^i : Ti, . . . , ^ 

E2 . Since U c [ki : Si, . . . ,km '. Sm ] and U <■ [h : Ti, ... ,ln : Tn] hy the 
subtyping laws, transitivity of subtyping gives us bF c [ki : Si, ... ,km : Sm ] 
and W <■ [li : Ti, . . . ,ln : Tn]. Now, by the induction hypothesis, Ei |= ai 
and E2 \= a2. But this means that A |= cr, as required. 

Case EP-As: p = x ^ f 

matchiv, f) v' 
a = (x = v') 

By TP- As, F / G U^V and E = {x : V). By part (2) of the induction 
hypothesis, \~ v' G X for some A c V. So (x = v') \= {x : V) by the 
definition of satisfaction. □ 
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3.9.3 Theorem [Safety]: 

1. If h e G T, then e ij. v for some v. (That is, the evaluation of a closed, well- 
typed expression cannot lead to a match-failure or otherwise “get stuck.”) 

2. If h / S S' — > T and \~ v G R <■ S, then match{v, f) v' with \- v' € T' <■ 
T. 

3. If \- p G U ^ r' and \~ v G S <■ U, then match{v, p) cr with F' \= a. □ 

Proof: Straightforward induction on derivations. □ 

4 Conclusions 

We have described a type system that may be of use in checking programs 
or queries that apply to semistructured data. Unlike other approaches to the 
problem, it is a “relaxed” version of a conventional system that can handle the 
kinds of irregular types that occur in semistructurd data. 

Although we have established the basic properties of the type system, a good 
deal of work remains to be done. First, there are some extensions that we do not 
see as problematic. These include: 

— Both strict and relaxed set-union operations. (In the former case the two 
types are constrained to be equivalent.) Similarly, one can imagine strict 
and relaxed case expressions. 

— Equality. Both “absolute” equality and “equality at type T” fit with this 
scheme. 

— A _L (“bottom”) type - the null-ary case of union types. An immediate 
application is in the typing rule T-Set for set formation, where we can 
remove the side condition n > 1 to allow formation of the empty set: { 0 } G 
_L. 

— Additional base types such as booleans and operations such as set filtering. 

— AT (“top”) type. Such a type would be completely dynamic and would 
be analyzed by typecase expressions. One could also add type inspection 
primitives along the lines described for Amber [Carrk] 

— An “otherwise” or “fall-through” branch in case expressions. 

A number of more significant problems also remain to be addressed. 

— Complexity. The obvious method of checking whether two types are equiv- 
alent or whether one is a subtype of the other involves first reducing both to 
disjunctive normal form. As we have observed, this process may be exponen- 
tial in the size of the two type expressions. We conjecture that equivalence 
(and subtyping) can be checked faster, but we have not been able to show 
this. 

Even if these problems turn out to be intractable in general, it does not nec- 
essarily mean that this approach to typing semistructured data is pointless. 
Type inference in ML, for example, is known to be exponential [KTU94], yet 
the forms of ML programs that are the cause of this complexity never occur 
in practice. Here, it may be the case that types that types that only have 
“small” differences will not give rise to expensive transformations. 
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— Recursive types. The proof of the decidability of subtyping (3.4.10) works 
by induction on the derivation tree of a type, which is closely related to the 
structure of the type. We do not know whether the same result holds in the 
presence of recursive types. 

— Relationship with other typing schemes. There may be some relation- 
ship between the typing scheme proposed here and those mentioned ear- 
lier [NAM97,Ali99] that work by inferring structure from semi-structured 
data. Simulation, for example, gives rise to something like a subtyping rela- 
tionship [BDFS97]; but it is not clear what would give rise to union types. 

— Applications. Finally, we would like to think that a system like this could 
be of practical benefit. We mentioned that there is a group of biological 
data formats that are all derived from a common basic format. We should 
also mention that the pattern matching constructs introduced in section 2.2, 
independently of any typing issues, might be used to augment other query 
languages such as XML-QL [DFF+] that exploit pattern matching. 
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Abstract. Path constraints have been studied for semistructured data 
modeled as a rooted edge-labeled directed graph [4, 11-13]. In this model, 
the implication problems associated with many natural path constraints 
are undecidable [11, 13]. A variant of the graph model, called the deter- 
ministic data model, was recently proposed in [10]. In this model, data is 
represented as a graph with deterministic edge relations, i.e., the edges 
emanating from any node in the graph have distinct labels. This model is 
more appropriate for representing, e.g., ACeDB [27] databases and Web 
sites. This paper investigates path constraints for the deterministic data 
model. It demonstrates the application of path constraints to, among 
others, query optimization. Three classes of path constraints are consid- 
ered: the language Pc introduced in [11], an extension of Pc, denoted by 
P“, by including wildcards in path expressions, and a generalization of 
Pc , denoted by P^ , by representing paths as regular expressions. The 
implication problems for these constraint languages are studied in the 
context of the deterministic data model. It is shown that in contrast to 
the un decidability result of [11], the implication and finite implication 
problems for Pc are decidable in cubic-time and are finitely axiomatiz- 
able. Moreover, the implication problems are decidable for . However, 
the implication problems for P^ are undecidable. 



1 Introduction 

Semistructured data is usually modeled as an edge-labeled rooted directed graph 
[1,8]. Let us refer to this graph model as the semistructured data model (SM). 
For data found in many applications, the graph is deterministic, i.e., the edges 
emanating from each node in the graph have distinct labels. For example, when 
modeling Web pages as a graph, a node stands for an HTML document and an 
edge represents a link with an HTML label from one document (source) to an- 
other (target) . It is reasonable to assume that the HTML label uniquely identifies 
the target document. Even if this is not literally the case, one can achieve this 
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by including the URL (Universal Resource Locator) of the target document in 
the edge label. This yields a deterministic graph. As another example, consider 
ACeDB [27], which is a database management system popular with biologists. 
A graph representing an ACeDB database is also deterministic. In general, any 
database with “exportable” data identities can be modeled as a deterministic 
graph by including the identities in the edge labels. Here by exportable iden- 
tities we mean directly observable identities such as keys. Some relational and 
object-oriented database management systems support exportable identities. In 
the OEM model (see, e.g., [3]), there are also exportable object identities. To cap- 
ture this, we consider a variant of SM , referred to as the deterministic data model 
(DM). In DM, data is represented as a deterministic, rooted, edge-labeled, di- 
rected graph. An important feature of DM is that in this model, each component 
of a database can be uniquely identified by a path. 

A number of query languages (e.g., [3,9,24]) have been developed for semi- 
structured data. The study of semistructured data has also generated the design 
of query languages (e.g., [17]) for XML (extensible Markup Language [7]) data. 
In these languages, queries are described in terms of navigation paths. To opti- 
mize path queries, it often appears necessary to use structural information about 
the data described by path constraints. Path constraints are capable of express- 
ing natural integrity constraints that are a fundamental part of the semantics of 
the data, such as inclusion dependencies and inverse relationships. In traditional 
structured databases such as object-oriented databases, this semantic informa- 
tion is described in schemas. Unlike structured databases, semistructured data 
does not have a schema, and path constraints are used to convey the semantics 
of the data. The approach to querying semistructured data with path constraints 
was proposed in [4] and later studied in [11-13]. Several proposals (e.g., [6, 19, 
21,22]) for adding structure or type systems to XML data also advocate the 
need for integrity constraints that can be expressed as path constraints. 

To use path constraints in query optimization, it is important to be able to 
reason about them. That is, we need to settle the question of constraint im- 
plication: given that certain constraints are known to hold, does it follow that 
some other constraint is necessarily satisfied? In the database context, only fi- 
nite instances (graphs) are considered, and implication is referred to as finite 
implication. In the traditional logic framework, both infinite and finite instances 
(graphs) are permitted, and constraint implication is called (unrestricted) im- 
plication. For the model SM, it has been shown that the implication and finite 
implication problems associated with many natural constraints are undecidable. 
For example, these problems are undecidable for the simple constraint language 
Pc introduced in [11-13]. In addition, we have already studied the connection 
between object-oriented databases and semistructured databases in SM with Pc 
constraints [12]. The results of [12] show that the connection is not simple. 

In this paper, we investigate path constraints for DM. We demonstrate ap- 
plications of path constraints to semantic specification and query optimization, 
and study the implication problems associated with path constraints. We show 
that in contrast to the undecidability result of [11, 13], the implication and finite 
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implication problems for P^. are decidable in cubic-time and are finitely axiomati- 
zable in the context of DM. That is, there is a finite set of inference rules that is 
sound and complete for implication and finite implication of Pc constraints, and 
in addition, there is an algorithm for testing Pc constraint implication in time 
0{n^), where n is the length of constraints. This demonstrates that the deter- 
minism condition of DM simplifies the analysis of path constraint implication. 
We also introduce and investigate two generalizations of Pc- One generaliza- 
tion, denoted by is defined by including wildcards in path expressions. The 
other, denoted by P* , represents paths by regular expressions. We show that in 
the context of DM , the implication and finite implication problems for are 
also decidable. However, these problems are undecidable for P* in the context 
of DM. This undecidability result shows that the determinism condition of DM 
does not reduce the analysis of path constraint implication to a trivial problem. 

The rest of the paper is organized as follows. Section 2 uses an example to 
illustrate how path constraints can be used in query optimization. Section 3 
reviews the definition of Pc constraints proposed in [ 11 ], and introduces two 
extensions of Pc, namely, and P*. Section 4 studies the implication and 
finite implication problems for Pc, P™ and P* for the deterministic data model. 
Finally, Section 5 identifies open problems and directions for further work. 

2 An example 

To demonstrate applications of path constraints, let us consider Fig. 1, which col- 
lects information on employees and departments. It is an example of semistruc- 
tured data represented in the deterministic data model DM. In Fig. 1, there 
are two edges emanating from root r, labeled emp and dept and connected to 
nodes Emp and Dept, respectively. Edges emanating from Emp are labeled em- 
ployee ID’s and connected to nodes representing employees. An employee node 
may have three edges emanating from it: an edge labeled manager and connected 
to his/her manager, an edge labeled supervising that connects to a node from 
which there are outgoing edges to employees under his/her supervision, and an 
edge labeled name. Similarly, there are vertices representing departments that 
may have edges connected to employees. Observe that Fig. 1 is deterministic. 
Path constraints. Typical path constraints on Fig. 1 include: 

V X (emp • _ • manager(r, x) emp • _ (r, x)) (4>i) 

V X (emp • _ • supervising • _ (r, x) emp • _ (r, x)) (^ 2 ) 

V X (emp • - (r, x) (manager(x, y) supervising • _ (y, x))) (^ 3 ) 

Here r denotes the root of the graph, variables x and y range over vertices, 
and is a “wildcard” symbol, which matches any edge label. A path in the 
graph is a sequence of edge labels, including wildcards. Path constraints describe 
inclusion relationships. For example, (f>i states that if a node is reached from the 
root r by following emp • _ • manager, then it is also reachable from r by following 
emp • _ . It asserts that the manager of any employee is also an employee that 
occurs in the database. The constraints given above are in the language P“. 
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Fig. 1. An example semistructured database in DM 

We generalize by representing paths as regular expressions. This gener- 
alization is denoted by P* . For example, the following are constraints of P*-. 

V X (emp • . (r, x) ^ V y (manager -manager* (x, y) 

supervising • _ (y, x))) (^i) 

V X (emp • . (r, x) ^ V y (supervising • _ (x, y) 

manager • manager* (y , x))) (^ 2 ) 

Here * is the Kleene star. These constraints describe an inverse relationship 
between manager • manager* and supervising . For example, asserts that 
for any employee x and for any y, if y is reachable from x by following one or more 
manager edges, then x is reachable from y by following path supervising • _ . 

A subclass of P* , Pc, has been investigated in [11-13] for the graph model 
SM for semistructured data. As opposed to P* constraints, path constraints of 
Pc contain neither wildcards nor the Kleene star. In DM, Pc constraints express 
path equalities. For example, the following can be described by Pc constraints: 

emp • el • manager = emp • e2 ({pi) 

dept • dl • emp • e?> = emp • e3 ((P 2 ) 

Semantic specification with path constraints. The path constraints above 
describe certain typing information about the data. For example, abusing object- 
oriented database terms, (f>i asserts that a manager of an employee has an 
“employee type”, and in addition, is in the “extent” of “class” employee. By us- 
ing (f>i , it can be shown that for any employee x and any y, if y is reachable from 
X by following zero or more manager edges, then y also has an “employee type” 
and is in the “extent” of employee. A preliminary type system was proposed in 
[10] for the deterministic data model, in which the types of paths are defined 
by means of path constraints. This is a step towards unifying the (programming 
language) notion of a type with the (database) notion of a schema. 

Query optimization with path constraints. To illustrate how path con- 
straints can be used in query optimization, consider again the database shown 
in Fig. 1. Suppose, for example, we want to find the name of the employee with 
ID e3 in department dl. One may write the query as Qi (in Lorel syntax [3]): 

Qi: select X.name 

from r . dept . dl . emp . e3 X 
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Given path constraint (^2, the query Qi can be rewritten as Q[: 

Q[: select X.name 

from r . emp . e 3 X 

One can easily verify that Qi and Q[ are equivalent. 

As another example, suppose we want to find the names of the employees 
connected to Smith by one or more manager edges. Without path constraints, 
one would write the query as Q2 (in Lorel syntax): 

Q2: select X.name 

from r.emp.y, X, X(. manager )+ Y 
where Y.name = "Smith" 

In Lorel, "/, denotes wildcard and (. manager )+ means one or more occurrences 
of manager edges. Given constraints ^1, ^2, 4 >i &nd (f>2, we can rewrite Q2 as 
Q'2, which finds the names of the employees under the supervision of Smith: 

Qj • select X.name 

from r.emp.y, Y, Y. supervising."/, X 
where Y.name = "Smith" 

It can be verified that given those path constraints, Q2 and Q'^ are equivalent. In 
addition, Q'^ is more efficient than Q2 because it does not require the traversal 
of sequences of manager edges. It should be mentioned that to show Q2 and Q'2 
are equivalent, we need to verify that certain constraints necessarily hold given 
that ^1, i[)2, 4 >i &iid (f>2 hold. That is, they are implied by ^1, ^2, 4 >i &nd (j)2- In 
particular, we need to show that ^3 below is implied by ^1, ^2, 4 >i &nd 4>2'- 

V X (emp • _ • manager* (r, x) emp • _ (r, x)) (^3) 

Related work. Path constraints have been studied in [4, 11-13] for the model 
SM. The constraints of [4] have either the form p C q or p = q, where p and 
q are regular expressions representing paths. The decidability of the implication 
problems for this form of constraints was established in [4] for SM. Another path 
constraint language. Pc, was introduced in [11]. It was shown there that despite 
the simple syntax of Pc , its associated implication and finite implication problems 
are undecidable in the context of SM. Detailed proofs of these undecidability 
results can be found in [13]. However, these papers have considered neither the 
deterministic data model DM nor the path constraint languages and P* . 

Recently, the application of integrity constraints to query optimization was 
also studied in [25]. Among other things, [25] developed an equational theory for 
query rewriting by using a certain form of constraints. 

The connection between object-oriented (GO) databases and semistructured 
databases in SM with Pc constraints has been studied in [12]. GO databases 
are constrained by types, e.g., class types with single-valued and set-valued at- 
tributes, whereas databases in SM are in general free of these type constraints. 
These types cannot be expressed as path constraints and vice versa. As an exam- 
ple, it has been shown in [12] that there is a Pc constraint implication problem 
that is decidable in PTIME in the context of SM, but that becomes undecidable 
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when an 00 type system is added. On the other hand, there is a Pc constraint 
implication problem that is undecidable in the context of SM, but becomes 
decidable in PTIME when an 00 type system is imposed. 

There is a natural analogy between the work on path constraints and in- 
clusion dependency theory developed for relational databases. Path constraints 
specify inclusions among certain sets of objects, and can be viewed as a gener- 
alization of inclusion dependencies. They are important in a variety of database 
contexts beyond relational databases, ranging from semistructured data to 00 
databases. It should be mentioned that the path constraints considered in this 
paper are not expressible in any class of dependencies studied for relational 
databases, including inclusion and tuple-generating dependencies [5]. 

The results established on path constraint implication in this paper may find 
applications to other fields. Indeed, if we view vertices in a graph as states and 
labeled edges as actions, then the deterministic graphs considered here are in 
fact Kripke models studied in deterministic propositional dynamic logic (DPDL. 
See, e.g., [20,28]), which is a powerful language for reasoning about programs. 
These deterministic graphs may also be viewed as feature structures studied in 
feature logics [26]. It should be mentioned that DPDL and feature logics are 
modal logics, in which our path constraints are not expressible. 

Description logics (see, e.g., [16]) reason about concept subsumption, which 
can express inclusion assertions similar to path constraints. There has been work 
on specifying constraints on semistructured data by means of description logics 
[15]. One of the most expressive description logics used in the database context 
is ACCQTreg [16]. It is known that ALCQXreg corresponds to propositional 
dynamic logic (PDL) with converse and graded modalities [16,20]. We should 
remark here that our path constraints are not expressible in ALCQXreg- 

3 Deterministic graphs and path constraints 

In this section, we first give an abstraction of semistructured databases in DM, 
and then present three path constraint languages: P^ and P* . 

3.1 The deterministic data model 

In the graph model SM, a database is represented as an edge-labeled rooted 
directed graph [1,8]. An abstraction of databases in SM has been given in [11] 
as (finite) first-order logic structures of a relational signature 

a = (r, E), 

where r is a constant denoting the root and E is a set of binary relation symbols 
denoting the edge labels. 

In the deterministic data model DM, a database is represented as an edge- 
labeled rooted directed graph satisfying the determinism eondition. That is, for 
any edge label K and node a in the graph, there is at most one edge labeled 
K going out of a. Along the same lines of the abstraction of databases in SM, 
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we model a database in DM as a (finite) cr-structure satisfying the determinism 
condition. Such structures are called deterministic structures. A deterministic 
structure G is specified by (|G|, r^ , E'^), where |G| is the set of nodes in G, 
is the root node, and E'^ is the set of binary relations on |G|. 

3.2 Path constraint langnage 

Next, we review the definition of Pc constraints introduced in [11]. To do this, 
we first present the notion of paths. 

A path is a sequence of edge labels. Formally, paths are defined by the syntax: 

p ::= e\K \ K- p 

Here e is the empty path, K £ E, and • denotes path concatenation. For example, 
emp • el • manager and dept • dl • emp • e3 are paths given in Sect. 2. 

A path can be expressed as a first-order logic formula p(x, y) with two free 
variables x and y, which denote the tail and head nodes of the path, respectively. 
For example, the paths given above can be described by: 

3 2 : (emp(x, z) A3 w (el(z, w) A manager(w, y))) 

3 2 {dept{x, 2 ) A 3 w (dl(z, w) A3u (emp(w, u) A e3(u, y)))) 

We write p(x,y) as p when the parameters x and y are clear from the context. 

By treating paths as logic formulas, we are able to borrow the standard notion 
of models from first-order logic [18]. Let G be a deterministic structure, p{x, y) 
be a path formula and a, b be nodes in |G|. We use G |= p{a, h) to denote that 
p(a, h) holds in G, i.e., there is a path p from a to 6 in G. 

A path p is said to be a prefix of g if there exists 7 , such that g = p ■ 

The length of path p, jpj, is defined as follows: jpj = 0 if p = e; jpj = 1 -I- jpj if 
p can be written &s K ■ g, where K £ E. For example, \emp • el • manager\ = 3. 

By a straightforward induction on the lengths of paths, it can be verified that 
in DM , any component of a database can be uniquely identified by a path. 

Lemma 3.1: Let G be a deterministic structure. Then for any path p and node 
a £ |G|, there is at most one node b such that G |= p{a, b). m 

The path constraint language Pc introduced in [11] is defined in terms of 
path formulas. A path constraint Lp of Pc is an expression of either 

— the forward form: \/ x (a(r,x) V y {l3(x,y) j(x,y))), or 

— the backward form: \/ x (a(r,x) V y {l3(x,y) j(y, x))). 

Here a,j3,^ are path formulas. Path a is called the prefix of (p, denoted by pf{p>). 
Paths P and 7 are denoted by lt{ip) and rt{(p), respectively. 

For example, (pi and (p 2 given in Sect. 2 can be described by Pc constraints. 
A proper subclass of Pc was introduced and studied in [4]. A word constraint 
has the form \f x (P{r,x) 'y{r,x)), where and 7 are path formulas. In other 

words, a word constraint is a forward Pc constraint with its prefix being the 
empty path. It has been shown in [11] that many Pc constraints cannot be 
expressed as word constraints or even by the more general constraints of [4] . 
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Next, we describe implication and finite implication of Pc constraints in the 
context of the deterministic data model. We assume the standard notion of 
models from first-order logic [18]. Let S U be a finite subset of Pc- We use 
E \= (p io denote that E implies p> in the context of DM. That is, for any 
deterministic structure G, if G |= E, then G \= p. Similarly, we use E \=f ip to 
denote that E finitely implies p. That is, for any finite deterministic structure 
G, if G H L;, then G |= p. 

In the context of DM , the implieation problem for Pc is the problem to 
determine, given any finite subset E U {p} of Pc, whether E \= p. Similarly, the 
finite implieation problem for Pc is the problem of determining whether E \=f p. 

In the context of SM, the structures considered in the implication problems 
are not necessarily deterministic. For SM, the following was shown in [11, 13]. 

Theorem 3.2 [11, 13]: In the context of SM, the implication problem for Pc 
is r.e. complete, and the finite implication problem for Pc is co-r.e. complete. ■ 

We shall show that this undecidability no longer holds in the context of DM. 



3.3 Path constraint langnage 

Let us generalize path expressions by including the union operator + as follows: 

w ::= e\K\w-w\w+w 

That is, we define path expressions to be regular expressions which do not contain 
the Kleene closure. Let us refer to such expressions as union regular expressions . 

Let p be a union regular expression and p be a path. We use p € p to denote 
that p is in the regular language defined by p. 

We also treat a union regular expression p as a logic formula p(x, y), where x 
and y are free variables. We say that a deterministic structure G satisfies p(x, y), 
denoted by G ]= p{x, y), if there is p € p and a,b £ |G| such that G |= p{a, h). 
The following should be noted about union regular expressions. 

— The regular language defined by a union regular expression is finite. 

— Recall that the wildcard symbol matches any edge label. When E, the 
set of relation symbols in signature a, is finite, we can express as a union 
regular expression. More specifically, let E be enumerated as K\, K^, ..., Kn. 
Then can be defined as union regular expression: Ki + K 2 + ... + Kn. 

For example, emp • _ • manager and emp • _ • supervising • _. can be represented 
as union regular expressions. 

We define using the same syntax of Pc. But in we use union regular 
expressions instead of simple paths to represent path expressions. 

For example, path constraints ^ 1 , 4>2 and given in Sect. 2 are con- 
straints, but they are not in Pc. 

A deterministic structure G satisfies a constraint (f) of denoted hy G \= (f), 
if the following condition is satisfied: 

— when ^ is a forward constraint V x (p(r, x) \/y(q(x, y) s(x, y))). for 
all vertices a,b £ |G|, if there exist paths a £ p and £ q such that 
G \= a{r'^ , a)A(3{a, h), then there exists a path 7 € s such that G \= 7(0, 6); 
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— when ^ is a backward constraint V x (p(r, x) \/y(q(x, y) s(y, x))): 
for all vertices a,b £ |G|, if there exist paths a £ p and 13 £ q such that 
G \= a{r'^ , a) A /3(a, b), then there exists 7 € s such that G \= 7(6, a). 

The implication and finite implication problems for are formalized in the 
same way as for P^, as described in Sect. 3.2. Obviously, Pc is properly contained 
in Thus the corollary below follows immediately from Theorem 3.2. 

Corollary 3.3: In the context of SM, the implication and finite implication 
problems for are undecidable. ■ 

We shall show that this undecidability result also breaks down in DM. 



3.4 Path constraint langnage P* 

We next further generalize the syntax of path expressions by including the Kleene 
closure * as follows: 

e ::= e|fi'|e-e|e + e|e* 

That is, we define path expressions to be general regular expressions. Recall 
that the wildcard symbol can be expressed as a (union) regular expression if 
E is finite. In Sect. 2, we have seen the following path expressions that can be 
expressed as regular expressions: manager • manager* , emp • _ • manager* , etc. 

Let p be a regular expression and p be a path. As in Sect. 3.3, we use p £ p 
to denote that p is in the regular language defined by p. We also treat p as a 
logic formula p(x, y), and define G |= p(x, y) for deterministic structure G. 

The syntax of P* constraints is the same as that of P“ . But path expressions 
are represented by general regular expressions in P* constraints, rather than by 
union regular expressions. 

For example, ipi, '>p 2 and ^3 given in Sect. 2 are P* constraints, but they are 
in neither Pc nor P“. 

As in Sect. 3.3, for a deterministic structure G and a P* constraint tl>, we 
can define the notion of G \= tjj- Similarly, we can formalize the implication and 
finite implication problems for P* . 

For example, recall constraints ^1, ^2, and ^3 given in Sect. 2 and 

let S be {tpi, tp 2 , 4>iy 9^2}- Then the question whether S \= \=f ’4’s) is an 

instance of the (finite) implication problem for P* . In Sect. 2, this implication 
is used in the proof of the equivalence of the queries Q 2 and Qj- 

Clearly, is a proper subset of P* . Therefore, by Corollary 3.3, we have: 

Corollary 3.4: In the context of SM, the implication and finite implication 
problems for P* are undecidable. ■ 

We shall show that this undecidability still holds in the context of DM. 

4 Path constraint implication 

In this section, we study the implication problems associated with Pc, and P* 
for the deterministic data model DM. More specifically, we show the following. 
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Theorem 4.1: In the context of DM, the implication and finite implication 
problems for Pc are finitely axiomatizable and are decidable in cubic-time. ■ 

Proposition 4.2: In the context of DM , the implication and finite implication 
problems for P“ are decidable. ■ 

Theorem 4.3: In the context of DM , the implication and finite implication 
problems for P* are undecidable. ■ 

In contrast to Theorem 3.2 and Corollary 3.3, Theorem 4.1 and Proposi- 
tion 4.2 show that for DM, the implication problems for Pc and P“ are decid- 
able. These demonstrate that the determinism condition of DM may simplify 
reasoning about path constraints. However, Theorem 4.3 shows that this deter- 
minism condition does not trivialize the problem of path constraint implication. 



4.1 Decidability of Pc 



We prove Theorem 4.1 in two steps. We first present a finite axiomatization for 
Pc in the context of DM. That is, a finite set of inference rules that is sound 
and complete for implication and finite implication of Pc constraints. We then 
show that there is a cubic-time algorithm for testing Pc constraint implication. 
A finite axiomatization. Before we present a finite axiomatization for Pc, 
we first study basic properties of Pc constraints in the context of DM. While 
Lemma 4.6 given below holds in both SM and DM, Lemmas 4.4 and 4.5 hold 
in the context of DM but not in SM. Their proofs require Lemma 3.1. We omit 
the proofs due to the lack of space, but we encourage the reader to consult [14]. 

Lemma 4.4: For any (p = \/x(a(r, x) \/y{l3(x, y) ■yix, y))), i.e., forward 
constraint of Pc, there is a word constraint: =\/x(a • /3(r, x) ^ a ■ y(r, x)) 

such that for any deterministic structure G,G|=:^iffG|=^. ■ 

Lemma 4.5: For any p = \f x (a(r, x) ^\/y {l3(x, y) jiy, x))), i.e., backward 
constraint of Pc, there is a word constraint: ^ = Va; (a(r, x) ^ a ■ /3 ■ y(r, x)) 
such that for any deterministic structure G, if G |= ^ then G |= In addition, 
if G |= 3 a; (a • l3{r, x)) A ip, then G |= ■ 

Lemma 4.6: For any finite subset S U {p} of Pc, 

L; |= iff EU{3x(pf(p)-lt(p)(r,x))}\=p, 

D\=f p iff DU {3x(pf(p) ■ lt(p)(r, a;))} ]=/ p, 

where pf{p) and lt{p) are described in Sect. 3.2. ■ 

Based on Lemma 4.6, we extend Pc by including constraints of the existential 
form as follows: 

Pf = Pc U {3 X p(r, x) I p is a path}. 

Constraints of the existential form enable us to assert the existence of paths. As 
pointed out by [23], they are important for specifying Web link characteristics. 

For Pf, we consider a set of inference rules, Tc, given below. Note that the 
last four inference rules in Tc are sound in DM because of Lemmas 4.4 and 4.5. 



— Refiexivity: 



Va; (a(r,x) a(r,x)) 
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— Transitivity: Va; (a{r,x) P{r,x)) \fx (P{r,x) 'y{r,x)) 

\/x (a(r,x) ‘j(r,x)) 

- Right-congruence: \fx {a{r,x) j3{r,x)) 

\/x (a • 7 (r, x) ^ l3 • ‘j(r, x)) 



— Empty-path: 

— Prefix: 

— Entail: 

— Symmetry: 

— Eorward-to-word: 

— Word-to-forward: 



3x e(r,x) 

3x (a • l3(r, x)) 

3x a(r, x) 

3a; Q:(r, a;) \/x (a(r,x) ^ l3(r,x)) 

3x l3(r, x) 

3x a(r,x) Va; (a(r,x) l3(r,x)) 

Va; (l3(r,x) a(r,x)) 

Va; (o;(r,a;) ^ \/y {P{x,y) l{x,y))) 

Va; (a • l3(r, x) ^ a ■ ‘j(r, x)) 

Va; (a • l3(r,x) a • 7 (r, x)) 

Va; (o:(r,a;) ^ Vy {P{x,y) l{x,y))) 



- Backward-to-word: 3a; {a ■ /3(r, a;)) Va; {a{r, x) \fy {P{x, y) ^{y, x))) 

Va; (a(r, x) a • (3 • ‘j(r, x)) 



— Word-to-backward: Va; {a{r,x) a ■ (3 ■ 'y{r,x)) 

Va; (o:(r,a;) ^ Vy {f3{x,y) l{y,x))) 

Let S\J{(p} be a finite subset of P^. We use S \~x^ to denote that p> is provable 
from S using Tf.. That is, there is an Tc-proof of p> from S. 

The following theorem shows that Tf. is indeed a finite axiomatization of Pc- 

Theorem 4.7: In the context of DM, for each finite subset E U {cp} of Pc, 

E \=ip iff EU {3x(pf(ip) ■ lt(ip)(r, a;))} ip, 

E \=f ip iff EU{3x{pf{ip) ■ lt{ip){r, a;))} ip. ^ 

Proof sketch: It suffices to show that T’U {3 a; (p/(;^) • lt{ip){r, a;))} |= ip if and 
only if T’ U {3a; (pf{ip) ■ lt{ip){r, a;))} ip, because of Lemma 4.6. 

Soundness of Xc can be verified by induction on the lengths of Tc-proofs. Eor 
the proof of completeness, it suffices to show the following claim: There is a 
finite deterministic structure G such that G \= E U {3x{pf{ip) ■ lt{ip){r'^ , a;))}. 
In addition, if G |= then E U {3 a; (p/(;^) • lt{ip){r, a;)))} ip. 

To see this, suppose that T’ U {3 a; (pf{ip) ■ lt{ip){r, a;)))} |= ip. By the claim, 
G 1= T’ U {3a; (pf{ip) ■ lt{ip){r, a;)))}. Thus we have G |= In addition, since G 
is finite, if T’ U {3 a; (p/(;^) • lt{ip){r, a;))} |=/ ip, then G \= ip. Thus again by the 
claim, T’ U {3a; (pf{ip) ■ lt{ip){r, a;))} ip. Space limitations do not allow us to 
include the lengthy definition of G. The interested reader should consult [14]. ■ 
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As an immediate corollary of Theorem 4.7, in the context of DM , the impli- 
cation and finite implication problems for Pc coincide and are decidable. 

In addition, it can be shown that Ic is also a finite axiomatization of P®, by 
using a proof similar to that of Theorem 4.7. 

Theorem 4.8: In the context of DM, for any finite subset EU{(p} of P®, if is 
in Pc, then P |= iff EU{3x {pf{ip)-lt{ip){r, a;))} iff P |=/ p. Otherwise, 

i.e., when tp is an existential constraints, P |= iff P iff P |=/ <^. ■ 

For SM, [4] has shown that the first three rules of Pc, be., Refiexivity, Transi- 
tivity and Right-congruence, are sound and complete for word constraints. In the 
context of DM, however, these rules are no longer complete. To illustrate this, 
let a be a path and consider the word constraints: p = x (e(r, x) a(r, x) 
and 4> = y x{a{r, x) e(r, x)). By Lemma 3.1, it can be verified that (p |= (j). 
However, this implication cannot be derived by using these three rules. 

In the context of DM, the first seven rules of Pc are sound and complete for 
word constraint implication. Let P„ be the set consisting of these seven rules. 
With slight modification, the proof of Theorem 4.7 can show the following. 

Theorem 4.9: In the context of DM, for each finite set P U {p} of word 
constraints, P |= iff P U {3 x {lt{p){r, a;))} p iff |=/ p. m 

A cubic-time algorithm. Based on Theorem 4.7, we can show the following: 

Proposition 4.10: There is an algorithm that, given a finite subset P of Pc and 
paths a, P, computes a finite deterministic structure G in time 0{n^), where n 
is the length of P and a ■ p. The structure G has the following property: there 
are a,b £ |G| such that G |= a{r'^ , a) A P{a, h), and moreover, for any path 7, 
G 1= 7(0,6) iff PU {3a;(Q: • P{r,x))} \fx{a{r,x) \fy{P{x,y) -f[x,y))), 
G 1= 7(6,0) iff P U {3 a; (a • P{r,x))} Va;(Q:(r,a;) ^ \fy{P{x,y) -f{y,x))). 

■ 

The algorithm takes advantage of Lemma 3.1 and has low complexity. It con- 
structs the structure described in the proposition. Each step of the construction 
corresponds to an application of some inference rule in Pc. By Theorem 4.7, it 
can be used for testing Pc constraint implication in the context of DM. We do 
not include the proof of the proposition due to the lack of space. The interested 
reader should see [14] for a detailed proof and the algorithm. 

4.2 Decidability of 

We next prove Proposition 4.2. To establish the decidability of the implication 
and finite implication problems for P“, it suffices to give a small model argument: 
Claim: Given any finite subset SU{p} of P™ , we can effectively compute a bound 
k such that if P U has a deterministic model then it has a finite deterministic 
model of size at most k. 

For if the claim holds, then the implication and finite implication problems 
for P“ coincide and are decidable. 

To show the claim, let p = /\S A ^p and assume that there is a deterministic 
structure G satisfying p. Let 
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PEs{(f)) = {pf{tp) • pfi'fp) • rt{tp) \ tp £ SU {(p}, xp has the forward form} 

U {pf{tp) • lt{'tp) • rt{tp) \ tp £ SU {(p}, xp has the backward form), 
Pts{(p) = I £1 is a path, p £ PEs{<p), g £ p}, 

CloPts{(p>) = {p \ Q £ Pts{(p>), p < £)}. 

Here pf{tp), lt{tp) and rt{tp) are union regular expressions, g £ p means that path 
g is in the regular language defined by regular expression p, and p < g denotes 
that path p is a prefix of path g. Let E^ be the set of edge labels appearing 
in some path in Pts{<p). Then we define structure El to be {\H\, , E^) such 

that \H\ = {a I a e |G|, p £ CloPts{<f)), G |= p{r^ , a)}, , and for all 

a,b £ \H\ and K £ E, P[ \= K{a, b) iS K £ E^ and G \= K{a, b). It is easy 
to verify that H \= (p and P[ is deterministic, since G has these properties. By 
Lemma 3.1, the size of \H\ is at most the cardinality of CloPts{<p), which is finite 
because the regular language defined by a union regular expression is finite. 

It should be noted that E^ and CloPts{<p) are determined by <p only. Thus 
Proposition 4.2 holds even if E, the set of relation symbols in signature <t, is 
infinite. However, when wildcards are considered, we require E to be finite so 
that the wildcard can be expressed as a union regular expression. Note that the 
proof above is uniform in the size of E. More specifically, the proof gives an 
effective way of going from the number of labels n to a decision procedure T>„. 

4.3 Undecidability of P* 

Next, we show Theorem 4.3. We establish the undecidability of the (finite) impli- 
cation problem for Pp by reduction from the word problem for (finite) monoids. 
Before we give the proof, we first review the word problem for (finite) monoids. 
The word problem for (finite) monoids. Let T be a finite alphabet and 
(T*, •, e) be the free monoid generated by P. An equation over T is a pair 
{a, (3) of strings in P* . 

Let 0 = {{ai,p3i) \ ai,p3i £ P* , i £[l,n]} and a test equation 9 he (a, /3). We 
use 0 \= 9 {0 \=f 9) to denote that for any (finite) monoid {M, o, id) and any 
homomorphism h : P* ^ M, if h{ai) = h(/3i) for i £ [l,n], then h{a) = h{f3). 

The word problem for (finite) monoids is the problem to determine, given 
any 0 and 9, whether 0 \= 9 (0 \=f 9). 

The following result is well-known (see, e.g., [2]). 

Theorem 4.11: Both the word problem for monoids and the word problem for 
finite monoids are undecidable. ■ 

Reduction from the word problem. We next present an encoding of the 
word problem for (finite) monoids. Let Pq be a finite alphabet and 0q be a finite 
set of equations over Pq. Without loss of generality, assume To C E, where E is 
the set of binary relation symbols in signature a described in Sect. 3.1. Assume 

Po = {Kj I jG[l,m], Ki^Kjifi^j}, 

00 = I ai,/3i £ Pq, i £ [l,n]}. 

Note here that each symbol in Pq is a binary relation symbol in E. Therefore, 
each string a in Pq can be represented as a path, also denoted by a. 
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Let eo be the regular expression defined by eo = {Ki + + ■ ■ ■ + K^)* ■ We 

encode 0 q with a subset S of P* , which includes the following: for i € [l,n], 

\/x{eo{r, x) -^\/y{ai{x, y) (3i{x, y))), 

\/x{eo{r, x) \/y{Pi{x, y) ai{x, y))). 

We encode test equation {a, (3), where a and j3 are arbitrary strings in Pq , with 
(fi = \/x{eo{r, x) -^\/y{a{x, y) f3{x, y))). 

The lemma below shows that the encoding given above is indeed a reduction 
from the word problem for (finite) monoids. From this lemma and Theorem 4.11, 
Theorem 4.3 follows immediately. 

Lemma 4.12: In the context of DM , 

00 \= (a, l3) iS E \= ip, (a) 

00 1=/ (a, (3) iff ^ 1=/ (b) 

■ 

Proof sketch: We give a proof sketch of (b). The proof of (a) is similar and 
simpler. Owing to the space limit, we omit the details of the lengthy proof, but 
we encourage the interested reader to consult [14]. 

(if) Suppose 00 {a, (3). Then there is a finite monoid M and a homomor- 

phism h : Pq ^ M such that h{ai) = h{(3i) for i € [l,n], but h{a) ^ h{(3). 
We show that there is a finite deterministic structure G, such that G \= S and 
G do this, we define an equivalence relation on Pq : 

p ft Q iff h(p) = h(g). 

For each string p € Pq, let p be the equivalence class of p with respect to 
and let o{p) be a distinct node. Then we define a structure G = (|G|, r*^, E'^), 
such that |G| = {o{p) \ p € Pq} and the root = o(e). The binary relations are 
populated in G such that for any K £ E and o{p),o{g) £ |G|, G |= K{o{p), o{g)) 
iS p ■ K £ g. It can be verified that G is indeed a finite deterministic structure. 
In addition, G |= T’ and G ^ p. A property of eo used in the proof is e € bq. 
That is, the empty path is in the language defined by the regular expression bq. 

{only if) Suppose that there is a finite deterministic structure G such that G |= T’ 
and G |= Then we define a finite monoid (M, o, id) and a homomorphism 
h : Pq ^ M such that for each i £ [l,n], h{ai) = h{l3i), but h{a) ^ h{(3). To do 
this, we define another equivalence relation ~ on Pq as follows: 

pr^ g iff G 1= V a;(Bo(r, x) ^ My (p(x, y) g{x, y))) A 
Mx{eQ{r,x) -^\/y{g{x,y) p{x,y))). 

For each p £ Pq , let [p] denote the equivalence class of p with respect to ~. We 
define M = {\p] \ p £ Pq}, operator o by [p] o [g] = \p ■ g], identity id = [e], 
and h : Pq ^ M hy h : p [pj. It can be verified that {M, o, [e]) is a finite 
monoid, /i is a homomorphism, and moreover, h{ai) = h{f3i) for i £ [l,n], but 
h{a) ^ h{f3). The proof uses a property of bq: for any p £ Tq*, bq -p C bq. That is, 
the language defined by the regular expression bq • p is contained in the language 
defined by bq. ■ 
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5 Conclusion 

We have investigated path constraints for the deterministic data model DM. 
Three path constraint languages have been considered: Pc, and P* . We 
have demonstrated how constraints of these languages might be used for, among 
others, query optimization. We have also studied implication problems for these 
constraint languages in the context of DM. More specifically, we have shown 
that in contrast to the undecidability result of [11, 13], the implication and finite 
implication problems for Pc and are decidable in the context of DM. In 
particular, the implication problems for Pc are decidable in cubic-time and are 
finitely axiomatizable. These results show that the determinism condition of 
DM may simplify the analysis of path constraint implication. However, we have 
also shown that the implication and finite implication problems for P* remain 
undecidable in the context of DM. This shows that the determinism condition 
does not trivialize the problem of path constraint implication. 

A number of questions are open. First, a more general deterministic data 
model, DDM, was proposed in [10] , in which edge labels may also have structure. 
A type system for DDM is currently under development, in which certain path 
constraints are embedded. A natural question here is: do the complexity results 
established here hold in DDM? This question becomes more intriguing when 
types are considered. As shown in [12], adding a type to the data in some cases 
simplifies reasoning about path constraints, and in other cases makes it harder. 
Second, to define a richer data model for semistructured data, one may want 
to label edges with logic formulas. In this setting, do the decidability results 
of this paper still hold? Third, can path constraints help in reasoning about 
the equivalence of data representations? Finally, how should path constraints be 
used in reasoning about the containment and equivalence of path queries? 

Acknowledgements. We thank Victor Vianu for valuable suggestions. 
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Abstract. We consider two-dimensional spatial databases defined in 
terms of polynomial inequalities and focus on the potential of program- 
ming languages for such databases to express queries related to topo- 
logical connectivity. It is known that the topological connectivity test is 
not first-order expressible. One approach to obtain a language in which 
connectivity queries can be expressed would be to extend FO-I-Poly 
with a generalized (or Lindstrom) quantifier expressing that two points 
belong to the same connected component of a given database. For the 
expression of topological connectivity, extensions of first-order languages 
with recursion have been studied (in analogy with the classical relational 
model). Two such languages are spatial Datalog and FO+Poly+While. 
Although both languages allow the expression of non-terminating pro- 
grams, their (proven for FO+Poly+While and conjectured for spatial 
Datalog) computational completeness makes them interesting objects of 
study. 

Previously, spatial Datalog programs have been studied for more restric- 
tive forms of connectivity (e.g., piece- wise linear connectivity) and these 
programs were proved to correctly test connectivity on restricted classes 
of spatial databases (e.g., linear databases) only. 

In this paper, we present a spatial Datalog program that correctly tests 
topological connectivity of arbitrary compact (i.e., closed and bounded) 
spatial databases. In particular, it is guaranteed to terminate on this 
class of databases. This program is based on a first-order description of 
a known topological property of spatial databases, namely that locally 
they are conical. 

We also give a very natural implementation of topological connectivity 
in FO+Poly+While, that is based on a first-order implementation of 
the curve selection lemma, and that works correctly on arbitrary spatial 
databases inputs. Finally, we raise the question whether topological con- 
nectivity of arbitrary spatial databases can also be expressed in spatial 
Datalog. 
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1 Introduction 

The framework of constraint databases, introduced by Kanellakis, Kuper and 
Revesz [10] (an overview of the area of constraint databases can be found in [14]), 
provides a rather general model for spatial databases [16]. In this context, a 
spatial database, which conceptually can be viewed as an infinite set of points 
in the real space, is finitely represented as a union of systems of polynomial 
equations and inequalities (in mathematical terminology, such figures are called 
semi- algebraic sets [3]). The set of points in the real plane that are situated 
between two touching circles together with a segment of a parabola, depicted in 
Figure 1, is an example of such a spatial database and it could be represented 
by the polynomial constraint formula 

(a;^ + (y - 1)2 < 1 A 25a;2 {5y - > IQ) V - x = 0 A {0 < y A x < 1)). 

In this paper, we will restrict our attention to two-dimensional spatial databases, 
a class of figures that supports important spatial database applications such as 
geographic information systems (GIS). 




(0,0) 



Fig. 1. An example of a spatial database. 



In the past ten years, several languages to query spatial databases have 
been proposed and studied. A very natural query language, commonly known 
as FO+Poly, is obtained by extending the relational calculus with polynomial 
inequalities [16]. The query that returns the topological interior of a database S 
is expressed by the FO+POLY-formula 

{3e>0){yx'){yy'){{x-xr + {y-yr<e"^s{x',y')), 

with free variables x and y that represent the co-ordinates of the points in the 
result of the query. Although variables in such expressions range over the real 
numbers, FO-|-Poly queries can still be effectively computed [5, 18]. 
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A combination of results by Benedikt, Dong, Libkin and Wong [2] and results 
of Grumbach and Su [6] implies that one cannot express in FO+Poly that a 
database is topologically connected. The topological connectivity test and the 
computation of connected components of databases are decidable queries [8,17] 
and are of great importance in many spatial database applications, however. 

One approach to obtain a language in which connectivity queries can be 
expressed would be to extend FO+Poly with a generalized (or Lindstrom) 
quantifier expressing that two points belong to the same connected component 
of a given database. In analogy with the classical graph connectivity query, 
which cannot be expressed in the standard relational calculus but which can 
be expressed in languages that typically contain a recursion mechanism (such 
as Datalog), we study extensions of FO+Poly with recursion for expressing 
topological connectivity, however. Two such languages are spatial Datalog and 
FO+POLY+ While. 

Both languages suffer from the well-known defect that their recursion, that 
involves arithmetic over an unbounded domain (namely polynomial inequalities 
over the real numbers), is no longer guaranteed to terminate. Therefore, these 
languages are not closed. FO+Poly+While is known to be a computationally 
complete language for spatial databases [7], however. Spatial Datalog is believed 
to be complete too [11, 13]. It is therefore interesting to establish the termination 
of particular programs in these languages (even be it by ad hoc arguments) as 
it is interesting to do this for programs in computationally complete general- 
purpose programming languages. 

Spatial Datalog [10, 11, 13] essentially is Datalog augmented with polynomial 
inequalities in the bodies of rules. Programs written in spatial Datalog are not 
guaranteed to terminate. It is known that useful restrictions on the databases 
under consideration or on the syntax of allowed spatial Datalog programs are 
unlikely to exist [11]. As a consequence, termination of particular spatial recur- 
sive queries has to be established by ad-hoc arguments. On the other hand, if a 
spatial Datalog program terminates, a finite representation of its output can be 
effectively computed. 

A first attempt [11] to express the topological connectivity test in this lan- 
guage consisted in computing a relation Path which contains all pairs of points 
of the spatial database which can be connected by a straight line segment that 
is completely contained in the database and by then computing the transitive 
closure of the relation Path and testing whether the result contains all pairs 
of points of the input database. In fact, this program tests for piece-wise lin- 
ear connectivity, which is a stronger condition than connectivity. The program, 
however, cannot be guaranteed to work correctly on non-linear databases [11]: it 
experiences both problems with termination and with the correctness of testing 
connectivity (as an illustration: the origin of the database of Figure 1 (a) is a 
cusp point and cannot be connected to any interior point of the database by 
means of a finite number of straight line segments). 

In this paper, we follow a different approach that will lead to a correct im- 
plementation of topological connectivity queries in spatial Datalog for compact 
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database inputs. In our approach we make use of the fact that locally around 
each of its points a spatial database is conical [3]. Our implementation first de- 
termines (in FO-I-Poly) for each point a radius within which the database is 
conical. Then all pairs of points within that radius are added to the relation Path 
and, finally we use the recursion of spatial Datalog to compute the transitive 
closure of the relation Path} 

We raise the question whether topological connectivity of arbitrary (not nec- 
essarily compact) spatial databases can be implemented in spatial Datalog. It 
can be implemented in FO-|-Poly-|-While, the extension of FO-I-Poly with a 
while-loop. FO-I-Poly-I-While is a computationally complete language for spa- 
tial databases [7], and therefore the known algorithms to test connectivity (One 
of the oldest methods uses homotopy groups computed from a CAD [17]. A 
more recent and more efficient method uses Morse functions [9]) can be imple- 
mented in this language. Our implementation is a very natural one, however. It 
is based on a constructive version of the curve selection lemma of semi-algebraic 
sets [3, Theorem 2.5.5]. We show that this curve selection can be performed in 
FO-I-Poly. Also in this implementation the transitive closure of a relation Path 
(this time initialized using an iteration) is computed. Once this transitive closure 
is computed, a number of connectivity queries, such as “Is the spatial database 
connected?” , “Return the connected component of the point p in the database” , 
“Are the points p and q in the same connected component of the database?” 
can be formulated. Grumbach and Su give examples of other interesting queries 
that can be reduced to connectivity [6] . 

Both of the spatial Datalog and of the FO-I-Poly -|-While implementation 
we prove they are guaranteed to terminate and to give correct results. 

This paper is organized as follows. In the next section we define spatial da- 
tabases and the mentioned query languages and recall the property that spatial 
databases are locally conical. In Section 3, we will describe our spatial Data- 
log and FO-I-Poly-I-While implementations. In Section 4, we will prove their 
correctness and termination. 

2 Preliminaries 

In this section, we define spatial databases and three query languages for spatial 
databases. Let R denote the set of the real numbers, and the real plane. 

2.1 Definitions 

Definition 1. A spatial database is a geometrical figure in R^ that can be de- 
fined as a Boolean combination (union, intersection and complement) of sets 
of the form {{x,y) \ p{x,y) > 0}, where p{x,y) is a polynomial with integer 
coefficients in the real variables x and y. 

^ In fact, for our purposes it would suffice to consider the extension of FO-|-Poly with 
an operator for transitive closure, rather than the full recursive power of spatial 
Datalog. 
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Note that p{x,y) = 0 is used to abbreviate ^{p{x,y) > 0) A ^{—p{x,y) > 0). 

In this paper, we will use the relational calculus augmented with polynomial 
inequalities, FO+Poly for short, as a query language. 

Definition 2. A formula in FO+Poly is a first-order logic formula built using 
the logical connectives and quantifiers from two kinds of atomic formula: S{x,y) 
and p(xi, . . . ,Xk) > 0, where 5 is a binary relation name representing the spa- 
tial database and p(x\ , ,Xk) is a polynomial in the variables x\, . . . ,Xk with 
integer coefficients. 

Variables in such formulas are assumed to range over R. A second query language 
we will use is FO-I-Poly-|-While. 

Definition 3. A program in FO-|-Poly-|-While is a finite sequence of state- 
ments and while-loops. Each statement has the form R := {{xi, . . . ,Xk) \ ‘p(x\, 
. . . , Xk)}, where (p is an FO-|-Poly formula that uses the binary relation name S 
(of the input database) and previously introduced relation names. Each while- 
loop has the form while do P od, where P is a program and an EO-|-Poly 
formula that uses the binary relation name S and previously introduced relation 
names. 

The semantics of a program applied to a spatial databases is the operational, 
step by step execution. Over the real numbers it is true that for every computable 
constraint query, such as connectivity, there is an equivalent EO-|-Poly-|-While 
program. 

A restricted class of EO-|-Poly-|-While programs consists of programs in 
spatial Datalog. 

Definition 4. Spatial Datalog is Datalog where, 

1. The underlying domain is R; 

2. The only EDB predicate is S, which is interpreted as the set of points in the 
spatial database (or equivalently, as a binary relation) ; 

3. Relations can be infinite; 

4. Polynomial inequalities are allowed in rule bodies. 

We interpret these programs under the the bottom-up semantics. To conclude 
this section, we remark that a well-known argument can be used to show that 
EO-I-Poly can be expressed in (recursion-free) spatial Datalog with stratified 
negation [1]. In this paper we also admit stratified negation in our Datalog 
program. 

2.2 Spatial databases are locally conical 

Property 1 ([3], Theorem 9.3.5). Eor a spatial database A and a point p in the 
plane there exists a radius £p such that for each 0 < e < £p holds that (p, £)n A 
is isotopic to the cone with top p and base S^{p, e) fl A.^ 

^ With {p, e) we denote the closed disk with center p and radius e and with {p, e) 
its bordering circle. A homeomorphism /i : — )• is continuous bijective function 
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We remark that a spatial database is also conical towards infinity. In the 
next section, we will show that such a radius £p, can be uniformly defined in 
FO+Poly. 

The database of Figure 1 is locally around the origin isotopic to the cone that 
is shown in Figure 2. It is a basic property of semi-algebraic sets that the base 
{p, e) n A is the finite union of points and open arc segments on {p, s) [3] . 
We will refer to the parts of {p, e) f] A defined by these open intervals and 
points as the sectors of p in A. In the example of Figure 2, we see that the 
origin has five sectors: two arc segments of the larger circle, a segment of the 
parabolic curve and two areas between the two circles. Sectors are curves or fully 
two-dimensional. 




Fig. 2. The cone of (0, 0) of the spatial database in Figure 1. 



In the next sections, we will use the following property. It can be proven 
similarly as was done for closed spatial databases [12]. 

Property 2. Let A be a spatial database. Then the following holds: 

1. Only a finite number of cone types appear in A; 

2. A can only have infinitely many points of five cone types (interior points, 
points on a smooth border of the interior that (don’t) belong to the database, 
points on a curve, points on a curve of the complement); 

3. The number of cone types appearing in A is finite and hence the number of 
points in A with a cone different from these five is finite. 

The points with a cone of the five types mentioned in (2) of Property 2 are 
called regular points of the database. Non-regular points are called singular. We 
remark that the regularity of a point is expressible in FO-|-Poly [12]. 

whose inverse is also continuous. An isotopy of the plane is an orientation-preserving 
homeomorphism. Two sets are said to be isotopic if there is an isotopy that maps 
one to the other. 
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3 Connectivity queries in spatial Datalog 

In general, a set S of points in the plane is defined to be topologically connected 
if it cannot be partitioned by two disjoint open subsets. This second-order defini- 
tion seems to be unsuitable for implementation in spatial Datalog. Fortunately, 
for spatial databases S, we have the property that S is topologically connected 
if and only if S is path connected [3, Section 2.4] (i.e., if and only if any pair 
of points of S can be connected by a semi-algebraic curve that is entirely con- 
tained in S). In this section, we will show that for compact spatial databases 
path connectivity can be implemented in spatial Datalog and that for arbitrary 
databases it can be implemented in FO-|-Poly-|-While. 

3.1 A program in spatial Datalog with stratified negation for 
connectivity of compact spatial databases 

The spatial Datalog program for testing connectivity that we describe in this 
section is given in Figure 3. 



Path{x,y,x',y') 
Obstructed (x, y, x ,y') 



Path{x,y,x' ,y') 
Path{x,y,x' ,y') 
Disconnected 
Connected 





g>cone {S,x,y,x' ,y') 

~^S{x,y), x = ait + bi, y = a2t + b2, 
0 < t,t < l,bi = x,b2 = y, 
ai + bi = X ,02 + b2 = y 
-^Obstructed (x, y, x ,y') 

Path(x, y, x",y"), Path{x" , y" ,x ,y) 
S{x,y), S{x',y'), ~^Path{x,y,x ,y') 
^Disconnected. 



Fig. 3. A program in spatial Datalog with stratified negation for topological connec- 
tivity of compact databases. 



The first rule is actually an abbreviation of a spatial Datalog program that 
computes an FO-|-Poly formula ipcom{S,x,y,x',y') that adds to the relation 
Path all pairs of points {{x, y), {x' ,y')) £ SxS such that {x' ,y') is within distance 
S{x,y) of (x,y), where S(^x,y) is such that S is conical (in the sense of Property 1) 
in B‘^((x,y),e(x,y)). We will make the description of (pcone(S,x,y,x' ,y') more 
precise below. Then all pairs of points of the spatial database are added in 
the relation Path which can be connected by a straight line segment that is 
completely contained in the database. Next, the transitive closure of the relation 
Path is computed and in the final two rules of the program of Figure 3 it is tested 
whether the relation Path contains all pairs of points of the input database. 

Variations of the last two rules in the program of Figure 3 can then be used 
to formulate several connectivity queries (e.g., the connectivity test or the com- 
putation of the connected component of a given point p in the input database) . 

The description of (pcone{S,x,y,x' ,y') in FO-|-Poly will be clear from the 
proof of the following theorem. 
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Theorem 1. There exists an FO+Poly formula that returns for a given spatial 
database A and a given point p a radius Sp sueh that the database A is eonieal 
within B‘^(p,£p) (in the sense of Property 1). 

Proof. (Sketch) Let A be a spatial database and p be a point. If p is an in- 
terior point of A, this is trivial. Assume that p is not an interior point of 
A. We compute within the disk B‘^{p,l) the set 'jA,p in FO-|-Poly. For each 
£ < 1, e) n A is the disjoint union of a finite number of intervals and 

points. We then define 'jA,p H S^{p,s) to consists of these points and the mid- 
points of these intervals. For the database A of Figure 4 (a), 'jA,p is shown 
in (b) of that figure in full lines and 'YA<=,p is shown in dotted lines. These sets 
can be defined in FO-|-Poly using the predicate Betweenp^s(x' ,y' ,xi,yi ,X2,y2)- 
BetweeUp^six' ,y' ,xi,yi,X2,y2) expresses for points (x',y'), (xi,yi) and (x2,y2) 
on S^{p,s) that {x',y') is equal to (xi,yi) or (x2,y2) or is located between 
the clockwise ordered pair of points {{xi,yi), (x2,y2)) of S^{p,s) (for a detailed 
description of the expression of this relation in FO-|-Poly we refer to [12]). 





Fig. 4. The 1-environment of a database around p (a) and the construction to determine 
an Sp-environment (smaller dashed circle) in which the database is conical (b). In (b), 
')A,p is given in full lines and in dotted lines. 



Next, the singular points of 'jA,p U 'YA<^,p can be found in FO-|-Poly. Let d be 
the minimal distance between p and these singular points. Any radius strictly 
smaller than d, e.g., Sp = d/2, will satisfy the condition of the statement of this 
Theorem. 

Then B“^{p, £p) n {'jA,p^TA<=,p) consists of a finite number of non-intersecting 
(simple Jordan) curves starting in points S^{p,Sp) fl {'jA,p ^TA<^,p) and ending in 
p and that for every e < £p each have a single intersection point with S^{p,s). 
It is easy (but tedious) to show that there is an isotopy that brings B^{p,£p) n 
{'~fA,p U 'YA<=,p) to the cone with top p and base {p, £p) fl {'jA,p U jA‘^,p)- This is 
the isotopy we are looking for. □ 
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3.2 An FO+Poly+While program for connectivity of arbitrary 
spatial databases 

For compact spatial databases, all sectors of a boundary point p are all in the 
same connected component of the database (because a boundary point is always 
part of the database). Therefore all pairs of points in an gp-environment of p, 
can be added to the relation Path, even if they are in different sectors of p. For 
arbitrary databases, the sectors of a point p € dS \ are not necessarily in 
the same connected component of the database. This means that in general 
only pairs of points can be added to the relation Path if they are in the same 
sector of a point. We can achieve this by iteratively processing all sectors of the 
border points and adding only pairs of points that are in the same sector of a 
border point to the relation Path. For this iterative process we use the language 
FO+Poly+While. The resulting program is shown in Figure 5. 

As in the compact case, we first initialize a relation Path and end with 
computing the transitive closure of this relation. 

In the initialization part of the program, first all pairs of points which can be 
connected by a straight line segment lying entirely in the database, are added 
to the relation Path. Then, a 5-airy relation Current is maintained (actually 
destroyed) by an iteration that, as we will show, will terminate when the relation 
Current will have become empty. During each iteration step the relation Path 
is augmented with, for each border point p of the database, all pairs of points 
on the midcurve of the sector of p that is being processed during the current 
iteration. 

The remainder of this section is devoted to the description of the implementa- 
tions in FO+Poly of the algorithms INIT, SeRA (sector removal algorithm) and 
CSA (curve selection algorithm) of Figure 5. The correctness and termination 
of the resulting program will be proved in the next section. 

The relation Current will at all times contain tuples (xp,yp,e,x,y) where 
{xp,yp) range over the the border points of the input database A, where e is a 
radius and where {x, y) belong to a set containing the part of the e-environment 
of (xp,yp) that still has to be processed. Initially, INIT{S,Xp,yp,Sp,x,y) sets 
the relation Current to the set of five-tuples 

Vpj £pj 2 ;, y) I {xp, yp) € dS, Ep — 1, {x, y) G B {{xp, yp), £p) n /S'}. 

It is clear that INIT can be defined in FO-|-Poly. 

Next, for all border points p = (xp,yp) of the database, both in SeRA and 
CSA the “first sector” of p in the relation Current{xp,yp,£p,x,y) will be deter- 
mined. This is implemented as follows. We distinguish between a sector that is 
a curve and a fully two-dimensional one. We look at the latter case (the former 
is similar). 

® We denote the topological border of S by dS. 

The same is true for the point at infinity, which can be considered as a boundary 
point of the database that does not belong to the database. To improve readability, 
we only consider bounded inputs in this section. 




Expressing Topological Connectivity of Spatial Databases 



233 



Path := {{x,y,x',y') \ S{x,y) A S{x',y') A {x,y){x',y') C 5} 

Current := INIT(S,Xp,yp,e,x,y) 
while Current ^ 0 do 

Current := SeRA(Current{xp,yp, , x,y) 

Path := CSA(Current{xp,yp,£,u,v),x,y,x ,y') 

od 

y := 0 

while Y ^ Path do 
y := Path; 

Path := PathU {{x,y,x' ,y') \ {3x"){3y"){Path{x,y,x" ,y")A 

Path{x” ,y" ,x' ,y'))} 



Fig. 5. An FO+Poly+While program for topological connectivity of arbitrary data- 
bases. The notation {x,y){x' ,y') stands for the line segment between the points {x,y) 
and (x ,y'). 



We define an order on the circle S^{p,e) with 0 < e < £p, by using the 
relation Betweenp^s(x' ,y' , xi,yi, X 2 ,y 2 ), and by taking the point p+ (0,e) as a 
starting point (see proof of Theorem 1). For each 0 < e < £p, the intersection of 
the “first fully two-dimensional sector” with {p, s) is defined as the first (using 
the just defined order) open interval on this circle. This is clearly dependent on 
the radius s. For the database of Figure 6 (a) this dependency is illustrated in 
(b) of that figure. The “first sector” falls apart into four parts (shaded dark), 
depending on the radius s. Furthermore, as in Theorem 1, the first midcurve, i.e. 
the midcurve of the “first sector”, within radius £p is computed in FO+Poly 
(the thick curve segments in Figure 6 (b)). By our definition of the “first sector”, 
this first midcurve needs not to be connected. Hence, we obviously do not want 
to add all pairs {q, q') of points in this set to the relation Path. 

We can, however, compute a new (and smaller) £™” such that the curve 
of midpoints has no longer singular points within B^(p, £“”). In Figure 6 (b), 
the small dashed circle around p has radius £p®”. Within the radius £p®” the 
midcurve is connected and the point p belongs to its closure. 

SeRA now updates £p in the relation Current to £“” and removes the first 
sector from the relation Current. This means that the set of points {x, y) that are 
in the relation Current with the point p = (xp, yp) will initially be {p, £p) n A, 
then B‘^{p,Sp^'”) fl A minus the first sector of p (after the first iteration), then 
B‘^(p, £p®” ) n A minus the first two sectors of p (after the second iteration), etc. 

CSA will add to the relation Path, all pairs {q,q') of midpoints at different 
distances £, s' < £“” from p {s can be taken 0, if p belongs to the database) of 
the sector that has just been removed by SeRA. 
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Fig. 6. The Sp-environment of the point p in (a) and the “first sector” of p, the midcurve 
of the “first sector” and Sp®'* in (b). 



4 Correctness and termination of the programs 

In this section, we prove the correctness and termination of both programs of 
the previous section. 

Theorem 2. The spatial Datalog program of the previous seetion eorreetly tests 
eonneetivity of eompaet spatial databases and the FO+Poly+ While program 
of the previous seetion eorreetly tests eonneetivity of arbitrary spatial databases. 
In partieular, the spatial Datalog program is guaranteed to terminate on eom- 
paet input databases and FO+Poly+While program terminates on all input 
databases. 

Proof. (Sketch) To prove eorreetness, we first have to verify that for every input 
database S our programs are sound (i.e., two points in S are in the same con- 
nected component of S if and only if they are in the relation Path) . Secondly, we 
have to determine the termination of our programs (i.e., we have to show that 
the first while-loop in Figure 5 that initializes the relation Path ends after a finite 
number of steps and that for both programs the computation of the transitive 
closure of the relation Path ends) . To prove the latter it is sufficient that we show 
that there exists a bound a{S) such that any two points in the same connected 
component of S end up in the relation Path after at most a{S) iterations of the 
computation of the transitive closure. To improve readability, we only consider 
bounded inputs of the FO-|-Poly 4-While program in this proof. 

Soundness. The if-implication of soundness (cf. supra) is trivial. So, we con- 
centrate on the only-if implication. We use Collins’s Cylindrical Algebraic De- 
composition (CAD) [5] to establish the only-if direction. This decomposition 
returns for a polynomial constraint description of S, a decomposition c{S) of 
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the plane in a finite number of cells. Each cell is either a point, a 1-dimensional 
curve (without endpoints), or a two-dimensional open region. Moreover, every 
cell is either part of S or of the complement of S. In order to prove that any two 
points in the same connected component of S are in the transitive closure of the 
relation Path, it is sufficient to prove this for 

1. any two points of S that are in one cell of c(S), in particular, 

a. two points of S that are in the same region, 

b. two points of S that are on the same curve, and 

2. any two points of S that are in adjacent cells of c{S), in particular, 

a. a point that is in a region and a point that is on an adjacent curve, 

b. a point that is in a region and an adjacent point, 

c. a point that is on a curve and an adjacent point. 

l.a. In this case the two points p and q are part of a region in the interior of S, 

they can be connected by a semi-algebraic curve 7 lying entirely in the interior 
of S [3] . Since uniformly continuous curves (such as semi-algebraic ones) can be 
arbitrarily closely approximated by a piece- wise linear curve with the same end- 
points [15], p and q can be connected by a piece- wise linear curve lying entirely 
in the interior of 5, we are done. 

1. b. The curves in the decomposition are either part of the boundary of S or 
vertical lines belonging to S. In the latter case, the vertical line itself connects 
the two points. For the former case, let p and q be points on a curve 7 in the cell 
decomposition. We prove for the case of the FO-|-Poly-|-While program that p 
and q are in the transitive closure of the relation Path. For the spatial Datalog 
program the proof is analogous. Let 7 p, be the curve segment of 7 between p 
and q. Since all points r on 7 p, belong to the border of S, the algorithm SeRA 
processes the curve 7 twice as sectors of r. We cover 7 p, with disks B‘^{r,Sr), 
where Sr is the radius constructed by SeRA when processing 7 as a sector of r. 
Since 7 p, is a compact curve this covering has a finite sub-covering of, say, m 
closed balls. Then, the points p and q are in the relation Path after m iterations 
in the computation of the transitive closure. 

2 . a. A point on a vertical border line of a region can be connected by one single 
horizontal line with a point in the adjacent region. Hence, this case reduces to 
Case l.a. If the point is on a non-vertical boundary curve of S, the sector around 
that point, intersecting the adjacent region contains a midcurve, connecting the 
point to the interior of the adjacent region (in the case of the spatial Datalog 
program even more pairs are added). Again this case reduces to Case l.a. 

2.b. In this case there is a midcurve from p in to the two-dimensional sector 
intersecting the region cell in c{S). We distinguish between two cases. These two 
cases are depicted in Figure 7. In (a) the midcurve intersects the cell, while in 
(b) this is not the case. In Case (a), point p is connected by this midcurve to the 
cell, hence this case reduces to Case l.a. For Case (b), we let r be a midpoint of a 
curve computed by SeRA belonging to the connected component of the interior 
of S that contains q. Hence, after using a similar argument as in Case l.a, p and 
q belong to the transitive closure of the relation Path via a curve that passes 
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Fig. 7. The two cases in 2.b. 



through r. the vertical line through p adjacent to the region. 

2.C. There are various cases to be distinguished here. A vertical curve can be 
dealt with as before. A non-vertical curve is either a curve belonging to the 
border of 5 or a curve belonging to the interior of S. The latter case can be 
dealt with like in Case 2.b. For the former case, the algorithm SeRA will add p 
and a point of the border curve to the relation Path. The desired path to the 
border point can be found as in Case l.b. 

Termination. The first while-loop of the program in Figure 5 terminates since 
every border point of a spatial database has only a finite number of sectors in 
its cone and furthermore this number is bounded (this follows immediately from 
Property 2). After a finite number of runs of SeRA, the relation Current will 
therefore become empty. 

To prove the termination of the computation of the transitive closure of the 
relation Path in both programs, we return to Collins’s CAD. From the soundness- 
proof it is clear that it is sufficient to show that there exists an upper bound 
a{c) on the number of iterations of the transitive closure to connect two points 
in a cell c of c{S). 

We now show that for each region (i.e., two-dimensional cell) c in the CAD, 
there is a transversal 7 (c) in the relation Path from the bottom left corner of c 
to the upper right corner of c of finite length /?(c). Any two points of c can then 
be connected by at most a{c) = l3{c)+2 iterations of the transitive closure of the 
relation Path (namely by vertically connecting to the transversal and following 
it). For this we remark that the bottom left corner point of the cell c can be 
connected by a finite and fixed number of steps (see proof of soundness) with a 
point p in the interior of c. Similarly, the upper right corner point of c can be 
connected to some interior point q of c. The points p and q can be connected by 
a piece-wise linear curve, consisting of /?(c) line segments. This gives the desired 
transversal 7 (c). For cells c which are vertical line segments or single points 
the upper bound is 1. Remark that points on a one-dimensional cell c can also 
be connected by a finite number a{c) of line segments. This follows from the 
compactness of the curves (see Case l.b of the soundness proof). □ 
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5 Discussion 

It is not clear whether the first while-loop of the FO-|-Poly-|-While program of 

Figure 5, which initializes the Path relation, can be expressed in spatial Datalog 

with stratified negation. More generally, we can wonder about the following. 

Question: Can spatial Datalog with stratified negation express all computable 

spatial database queries? 
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Abstract. Linear constraint databases and query languages are appropriate for 
spatial database applications. Not only the data model is natural to represent a 
large portion of spatial data such as in CIS systems, but also there exist efficient 
algorithms for the core operations in the query languages. However, an important 
limitation of the linear constraint data model is that it cannot model constructs such 
as “Euclidean distance.” A previous attempt to expend linear constraint languages 
with the ability to express Euclidean distance, by Kuijpers, Kuper, Paredaens, 
and Vandeurzen is to adapt two fundamental Euclidean constructions with ruler 
and compass in a first order logic over points. The language, however, requires 
the input database to be encoded in an ad hoc LPC representation so that the 
logic operations can apply. This causes a problem that sometimes queries in their 
language may depend on the encoding and thus do not have any natural meaning. In 
this paper, we propose an alternative approach and develop an algebraic language 
in which the traditional operators and Euclidean constructions work directly on 
the data represented by “semi-circular” constraints. By avoiding the encoding step, 
our language do not suffer from this problem. We show that the language is closed 
under these operations. 



1 Introduction 

First-order logic with linear constraints (FO+lin) has turned out to be an 
appropriate language for expressing queries over spatial data, as in GIS sys- 
tems, for example. There are, however, certain limitations on the expressive 
power of FO+lin. Some of these limitations are inherent to first-order lan- 
guages in general, including the fact that connectivity cannot be expressed, 
and the tradeoffs between expressive power and efficiency in such cases have 
been well studied [BDLW98,PVV98,GS97]. There are, however, additional 
limitations that are a result not of the language itself, but rather of the class 
of linear constraints. The most significant of these restrictions is the inability 
to express the notion of Euclidean distance. 

If we were to consider a query language with polynomial constraints 
(FO+poly), we would clearly be able to express such queries, but such a lan- 
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guage would be far too powerful for our purposes and would be more difficult 
to implement. Although such a language is theoretically feasible, practical al- 
gorithms for efficient implementation of database systems with FO-f poly are 
still a research issue. A natural question is to ask whether there is a language 
between FO-Flin and FO-Fpoly with additional expressive power, but without 
the full power of FO+poly. The naive approach, restricting our attention to 
quadratic constraints, does not work — the requirement that the language be 
closed enables one to write queries whose results require higher-order poly- 
nomials. In addition, adding some geometric primitives, such as collinearity, 
to a first-order language, again yields the full power of FO-Fpoly. A more suc- 
cessful approach is the PFOL language of [VGV98]. This language enables 
one to express queries on finite databases that use Euclidean distance. How- 
ever, as long as one restricts the attention to databases in FO-Flin, one will 
still not be able to deal with queries such as “return all the points within a 
given distance,” over finitely representable databases. 

In DBPL ’97, a first attempt was made at a different approach to this 
problem [KKPV97]. The key observation used there is that the two relevant 
concepts, linear constraints and Euclidean distance, correspond to the two ba- 
sic operations of Euclidean geometry: constructions with ruler and compass. 
[KKPV97] defines a query language #circ that is situated strictly between 
FO-Flin and FO+poly and that expresses those queries that can be described 
as ruler- and-compass constructions on the input database. 

The original idea in the work of Kuijpers et al [KKPV97] had been to use 
lines and circles as primitive objects, and define operations on them. This 
did not work, as Euclidean geometry provided a clear intuition for what to 
do with these lines and circles, but not about interiors of regions - in other 
words, there was no natural way to define the operations on interiors of a 
region that were naturally derived from operations on their boundaries. Eor 
this reason, #circ applied to an encoding of objects as tuples of points. Using 
this encoding, #circ had the desired properties. 

Since objects in constraint databases are not encoded, a #circ query con- 
sists of three parts: an encoding step, that maps a fiat relation to its encoding, 
the “real” query, that works on this encoding, and a decoding step. The se- 
mantics of the query language depends on a specific, but arbitrary, encoding 
and this causes certain problems in #circ- Indeed the query language allows 
queries with no natural meaning (“return the object whose representative is 
closest to the origin”) to be expressed. 

In this paper we propose a different approach in which the data is repre- 
sented directly as spatial objects. The model is reminiscent of nested exten- 
sion [GS95] of the standard constraint model. The purpose is to avoid the 
need to use an encoding to refer to distinct geometrical objects. Our main 
contribution is to provide a natural extension of standard Euclidean opera- 
tions to interiors of regions. We generalize the notion of drawing a line (or 
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circle) between 2 points to that of drawing a line between 2 objects. The idea 
is that, given two objects, we take the union of all the lines that go through 
pairs of points from the two given objects. (A similar idea, drawing lines 
through the origin, was used in [VGV98], but only for linear databases). This 
may appear to give no additional power: as we shall see, the result can always 
be described by taking the boundary of the objects, drawing the appropri- 
ate lines between boundary points, and taking the interior of the result. Our 
direct approach, however, has the advantage of establishing a direct, natural 
connection between the original objects and the result of the operation, thus 
eliminating the need for auxiliary information to specify which interiors are 
in the database. 

In the next section we give basic definitions, and the following section 
defines the Euclidean operations on regions and the EuAIg languages based 
on them. We then prove that the language is closed, and conclude with related 
work and directions for future research. 



2 Basic Notions 

We consider spatial databases in the plane. In order to accommodate Eu- 
clidean operations, these must be over a subfield E) of IR that is closed under 
square roots. Most of our results apply to any such field. The minimal such 
field is known as the field of constructihle numbers. As in [KKPV97], we con- 
sider sets of points that can be described by lines and circles. These are called 
semi-circular sets. 

Definition. A subset of ID^ is called a semi-circular set iff it can be defined 
as a Boolean combination of sets of the form 

{{x, y) \ ax by c 0 



or 

{{x,y) I {x-af + {y-hf6c^}, 

where a, b, and c are in the domain ID and 9 in {<, <, =, >, >}. Let P be the 
set of all semi-circular sets over ID^. 

Definition. A Euclidean constraint is an equation in one of the following 
two forms: 

ax by c = 0 



or 



{x - of + {y- hf = , 




242 



Gabriel M. Kuper and Jianwen Su 



with a, b, and c in the domain ID. 

A semi-circular set is called rectangular if it can be represented by a 
formula of the form a<x<h/\c<y<d, with a, 6, c, d € ID. 

Definition. Let r be a semi-circular set. 

1. The boundary of r is the set of all points p in in ID^, such that every 
non-empty rectangular set that contains p contains both points in r and 
points not in r. 

2. A side of r is a maximal set of all those boundary points that satisfy a 
single Euclidean constraint. 

3. A point p in r is an isolated point of r if there is a non-empty rectangular 
set that contains p, but contains no other point of r. 

4. A point p in ID^ is a corner of a region r if p is either (1) an isolated 
point of r, or (2) a boundary point of r that is a member of at least two 
sides of r. 

Note that the notions of sides and corners are defined in such a way 
to generalize the intuitive notion of a “side” of a semi-linear set to include 
segments of circles as well as straight lines. It is straightforward to show (1) 
that any semi-circular set has a finite number of sides, (2) that each side of 
a semi-circular set is itself semi-circular set, and (3) that the boundary of a 
semi-circular set is also a semi-circular set. Note that the definitions apply 
to unbounded sets as well; in particular, ID^ has no sides and has empty 
boundary. 



3 The EuAIg Language 

We first define the data model. The basic types are 2-dimensional semi- 
circular sets. We shall use the term semi-circular relation for relations over 
these types (our terminology here is different from that used in [KKPV97], 
which uses a fiat model, where relations are simply semi-circular sets). 

Definition. A semi-circular n-tuple is a tuple t = (ti,. . . ,tn), where each 
ti is a semi-circular set. Two tuples t and t' are equivalent if the semi-circular 
sets represented by t\,. . . ,t„ are equal to the semi-circular sets represented 
hy t[, . . . ,t'„, respectively. A semi-circular relation R of arity n is a finite set 
of semi-circular n-tuples. 

Equivalence and containment of semi-circular relations can now be defined 
in a natural way; these are decidable, which follows from the decidability of 
the theory of real closed fields. Note that equivalence of relations differs from 
equivalence in the sense of [KKR95], as the current model is a nested one. In 
the current paper, we ignore non-spatial (thematic attributes), though these 
can be easily added to the model. 
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EuAIg is a nested algebraic query language, but with only one level of 
nesting (this is just to provide names for spatial objects). The nested model 
itself similar to the ones used in [BBC97], as well as the Dedale [GRS98] and 
Cosmos [KRSS98] prototypes. 

We now turn to the spatial primitives. The novel ones are extensions of the 
Euclidean primitives for drawing lines and circles to handle regions. We start 
with lines. The intuition is that the generalization of the notion of drawing a 
line between two points, to that of drawing lines between two regions, is to 
take the union of all lines that go through any point in the first region and 
any point in the second. For technical reasons, we actually use rays, rather 
than lines, a ray from p\ to p 2 being a “half-line,” starting at p\, and going 
through p 2 (a line is then easily definable as the union of two rays) . 

How do we handle circles? A circle in [KKPV97] is represented by a triple 
(pi,P2,P3), where p\ is the center, and d{p2,ps) is the radius. We generalize 
this directly to semi-circular sets, by taking the union of all circles with center 
in the first set, and radius equal to the distance between a point in the second 
and one in the third. (Alternative approaches are discussed in Section 5). 

Definition. Let pi, p 2 , and P 3 be points in ID^. 

1. ray{pi,p 2 ) is the half line that starts at p\ (including pi itself) and goes 
through the point p 2 - 

2. circ(pi,p 2 ,P 3 ) is the circle with center pi, and radius equal to d(p 2 ,P 3 ). 



Definition. Let ri, V 2 , and rs be semi-circular sets. 

1. RAY(n,r 2 ) = UpiGn.p 2 Gr 2 ™2/(pi,P2)- 

2. CIRC(n,r2,r3) = UpiGn.p 2 Gr 2 .P 3 Gr 3 «Vc (pi , P 2 , PS ) • 

3. BDR(n) = {p I p is a boundary point of ri}. 

4. SiDES(ri) = {r' I r' is a side of ri}. 

5. CORNERS(ri) = {p I p is a corner of ri}. 



Example 1. Consider the two regions ri and r 2 , where 
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'•2 

3 

RAY(rj, r^) 



VI 

''l 



ClRC(rj , Y2 , ^2) 



Fig. 1. Ray and circle drawing on two rectangles 



Then 



RAY(n,r2) 

RAY(ri,n) 

CiRC(n,r2,r2) 



CiRC(ri,n,r2) 



SIDES(ri) 



BDR(n) 

CORNERS(ri) 



51 =(0<2y — a;<3Aa;>lAy>l) 

V {y -2x <3 Ay >2) 

52 =10^ 

5 3 = (1-y/ 2 < a; < 2 +y/ 2 A 1 < y < 2) 

V (1 < a; < 2 A l-\/2 <y< 2+y/2) 

V ((a;-l)2 + (y-l)2 < 2) 

V ((a;-l)2 + {y-2f < 2) 

V ((a;-2)2 + {y-lf < 2) 

V ((a;-2)2 + {y-2f < 2) 

54 = (1-y/13 < x < 2+y/13 a 1 < y < 2) 

V (1 < a; < 2 A 1 -y/13 <y< 2+i/l3) 

V ((a;-l)2 + (y-l)2 < 13) 

V ((a;-l)2 + {y-2f < 13) 

V ((a;-2)2 + {y-lf < 13) 

V ((a;-2)2 + {y-2f < 13) 

where ri^i = x= lAl<y<2 
fi ,2 = x = 2Al<y<2 
ri ,3 = l<a;<2Ay = l 
^ 1,4 = l<a;<2Ay = 2 
n,i V ri ,2 V ri ,3 V ri,4 
{(1,1), (2,1), (1,2), (2, 2)} 



Figure 1 shows the result of RAY (ri,r2) and CiRC(ri,r2,r2). 



The EuAIg algebra is standard relational algebra, together with special 
Euclidean operators. We start with the standard part: 

Definition. 



1. rUs is the union of the relations r and s, i.e., the union of the sets of tuples 
in both relations with duplicates (i.e., equivalent tuples) eliminated. 

2. r n s is the set of those tuples in r that are equivalent to some tuple in s. 

3. r — s is the set of those tuples in r that are not equivalent to any tuple 



in s. 
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4. r X s is the Cartesian product of r and s. 

5. (Ti?(r) where F is one ofi = j,iCj,inj = ^ (where r has arity n, and 
h j < n) is the set of those tuples in r for which the sets in columns i 
and j satisfy F. 

6. TTxr is the set of all tuples of the for 7Tx(t) for t £ r. 

We now turn to the Euclidean operators. These operators have a certain 
resemblance to aggregate operators or functions in the relational model, in 
that each operators adds an additional column to the relation, whose value 
depends on the values of the other columns of each tuple. In the following 
definition, r will be a relation of arity n, and i, j and k < n. The result of 
Sop{r) will be a relation of arity n + 1, defined as follows: 

Definition. 

1. Set operators on the spatial extent: 

— Union: Eiuj(r) = U t.j) \ t £r}. 

— Intersection: Einj{r) = fl t.j) \t£r}. 

— Difference: Ei-j(r) = — t.j) \ t £ r}. 

2- £RAY(i,j){r) = {{t, RAY {t.i, t.j)) \ t £r}. 

3- £ciRc{i,j,k){r) = {{t,CiRC{t.i,t.j,t.k)) \ t £r}. 

4. £BDR{i){r) = {{t,BBR{t.i)) \ t £r}. 

5- £siDEs(i){r) = {{t,s) \ t £r,s £ SiDES(f.i)}. 

6- ^coRNERs(j) (^) — {(C®) \ t £ r, S £ CORNERS(t.i)}. 

Finally, the EuAIg language also has two constant relations €„ and (for 
“origin” and “unit), that contain the tuples {(0, 0)} and {(0, 1)} respectively. 
The need for 2 fixed points was discussed in [KKPV97]: These points can 
be used to simulate choice constructs ( “select an arbitrary point on a line” ) , 
that are used in many geometric constructions. 

Example 2. Consider the relations 

r = {(ri,r2),(ri,r3)} 



and 



s = {(ri,r 2 ,r 2 ), (ri,n,r 2 )} , 



where 



n=l<x<2Al<y<2 
r 2 = 3 < X <4A2 <y <3 
ra = y = x + 4 



Then 
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1- £^ray(i.2)(»') = {{ri,r2,si),{ri,r3,S5)}, where si = RAY(n,r2) was de- 
serihed in Example 1 and S5 = y > x — 1 . 

^circ(1,2,3) (s) = {{ri,r2,r2,S3),{ri,ri,r2,S4)} where S3 = CiRC(ri,r2,r2) 
and S4 = CiRC(ri , ri , r2) were also deseribed in Example 1 . 

Example 3 . We now illustrate how Euelidean eonstruetions ean he expressed 
in EuAlg. Letr he a binary relation that represents a set of pairs of lines. More 
formally, if (h, I2) in a tuple in r, then eaeh li represents a line. Suppose that 
we want to biseet the angles defined by these pairs of lines, i.e., to eompute 
a relations s sueh that (l\,l2,l) is in s iff (hjh) is in r and I is the line that 
biseeting the angle from l\ to I2. We ean express this as follows: 

1 . Compute the interseetion of eaeh pair of lines: 

ri = £in2{r) • 

2 . Draw all eireles with renters at these interseetion points, and with radius 
equal to unity. Then take the interseetions of these eireles with the original 
lines: 



1’2 — £^2n6£^ln6£^ciRc(3,4,5)^l ^ ^ ■ 

3 . Draw the two eireles with renters at these interseetions, and with radii 
equal to the distanee between them: 

fa = £^circ(7,7,8)^circ(8,7,8)^2 • 

4. Take the interseetions of these eireles, and then draw the rays through 
these points and through the vertex of the angle (the entire line is thus 
eomputed (note that eaeh line is eomputed twiee, but that duplieates are 
automatieally eliminated) . Einally the intermediate results are projeeted 
out: 



f4 — 7T(l,2,12)£^RAY(3,ll)^9nlO?’3 • 



4 Closure 

Our main result is that the EuAlg is closed: 

Theorem 1 . Let Q be a EuAlg expression, and r a semi-eireular relation. 
The Q(r) is also a semi-eireular relation. 

Proof Sketch: 

Note first that closure under the standard, non-Euclidean, operators is 
immediate, as is closure under the Euclidean operators £bdr, S^sides, corners, 
£iuj, ^inj, and Si-j. The proof will focus therefore on the remaining opera- 
tors, £ray and £circ- We can show: 
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Lemma 2. If the boundary of a set r is semi-eireular, so is r. I 

We now observe that RAY and CIRC are monotone, i.e., for example, 
RAY(n,r 2 U rs) = RAY(n,r 2 ) U RAY(n,r 3 ). We shall make frequent use of 
this fact: to start with, we may therefore assume that all input regions are 
connected, using an an induction argument together with monotonicity. 

Lemma 2 shows that we need only show that the boundary of the output 
of an operation is semi-circular. We would like to be able to consider only 
the boundary of the input as well, using an identity such as BDR(RAY(r, s)) = 
RAY(BDR(r), bdr(s)). Unfortunately, this does not hold in general (since the 
RAY operation will likely produce regions), but we can show: 

Lemma 3. Let r and s be eonneeted, non-empty, semi-eireular sets. Then 

BDR(RAY(r, s)) = BDR(r U S U RAY(BDR(r) , BDR(s))) . 

This is sufficient to show that if r, s and RAY(BDR(r), bdr(s)) are semi- 
circular sets, so is RAY(r, s). To prove that RAY (r,s) is always semi-circular, 
whenever r and s are, it therefore suffices to use a case analysis on r and 
s and then use the monotonicity of RAY. Several cases are illustrated in the 
following figures. 

1. r: point; s: line segment without endpoints. RAY(r, s) is the (open) region 

in the left side of Figure 2. 

2. r: point; s: ray without endpoint. See the right side of Figure 2. 



Fig. 2. Cases 1 and 2 



3. r: point; s: circle. If r is inside s, RAY(r, s) is ID^; otherwise, RAY(r, s) is 
similar to the left side of Figure 3. 

4. s: line segment. See the left side of Figure 4. 

5. For r: circle segment, and s: point. See the right side of Figure 4. 



This completes the proof that RAY(r, s) is semi-circular. We now sketch 
the proof for CIRC. Let r, s and t be semi-circular sets. We show that 
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Fig. 3. Case 3 




Fig. 4. Cases 4 and 5 



CiRC{r,s,t) is semi-circular. By monotonicity, we can assume that r, s, and 
t, are all connected. For the same reason, we consider the interior of r and 
r n BDR(r) separately; in fact we consider each side of the latter separately. 
One further assumption that we make is that the BDR(r) is connected. Let 

^ = {d{qs,qt) \qs € s,qt€ t} 

Then CiRC(r, s,t) is the union of all circles with center in r and radius equal 
to a number in 7?. C We claim that TZ is actually an interval. Assume 

that n, n' G TZ and n < n” < n' . Then there are points in s, and Pf, qt 

in t, such that d(ps,pt) = n and d(qs,qt) = n' . Since s and t are connected, 
there are paths connecting ps to g^, and pt to qt, that are contained in s and 
t, respectively. It immediately follows that there are points p and g on these 
paths with d{p,q) = n". 

By monotonicity, the following cases for TZ have to be considered: [a], 
(a, a'), (a, oo). Most of these are proven by induction from base cases similar 
to rays. The main exception is the case TZ = (a, oo). Here we need the set of 
points that are of distance more than > a from some point in r. The idea 
here is to construct the complement instead, i.e., the set of all points p that 
are of distance < a from every point of r. 

First observe that if r is unbounded the result must be empty. We then 
proceed in two steps: (1) show that circular sides can be replaced by straight 
edges, without changing the result, and then (2) showing that the result is 
semi-circular, when r is bounded semi- h'near set. 

For the first step there are two cases, depending on whether the arc is 
“convex” or “concave.” We show here the proof works in the case of concave 



^ 1D+ is the set of numbers in ID that are greater or equal to 0. 
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arc. Let r' be result of replacing a concave arc in r by a straight edge, and let 
gi and Q 2 be the endpoints of this arc. Assume that the arc is at most half 
a circle (one can add an extra vertex to assure this). Since r' contains r, it 
follows that if p is at distance < a from every point in r ' , that it is at distance 
< a from every point of r. For the converse, assume that p is of distance > a 
from some point g in r' — r. If the line from p to q, when extended, intersects 
r, then there must be a point in r of distance > a from p. Otherwise, it 
follows, using standard geometric techniques, that either d{p,qi) or d{p,q 2 ) 
is greater or equal to d{p,q), which is greater than a, by assumption. 

For (2), let r be semi-linear and bounded. Then it can be shown that a 
point p is at distance < a from every point in r iff it is at distance < a from 
every vertex of r. The latter set is the intersection of the circles with centers 
at the vertices of r and radii a, hence semi-circular. I 

5 Discussion and Future Work 

In this paper, we have proposed a language for spatial databases defined 
by line segments and circles, motivated by Euclidean geometry. One natural 
question is how does this language relate to that of [KKPV97]? In [KS98], 
we show that this language in fact captures a natural fragment of #circ, and 
that this fragment captures all of FO+lin; it would be interesting to know 
more about the relationship of EuAIg to traditional constraint languages, as 
well studying its complexity. As the Euclidean query languages can be seen 
as a safe restriction of other constraint languages, it would be of interest to 
see how they relate to the safe languages of [BL98]. 

Another interesting question concerns the choice of an encoding for circles. 
While other representations (by 3 points, or center and point on circle) may 
also seem reasonable approaches, and are in fact equivalent in the framework 
of [KKPV97], in our approach the representation seems to be critical. If we 
were to define £ciRc(*,i) to be the union of all circles that have a center in 
region i and go through a point in region j, the resulting language is not 
closed: if column 1 contains a circle, and column 2 a point, column 3 will 
contain a limagon, which is known not to be semi-circular (see [Due25] for 
the original definition and [Loc61] for a reduction of the construction of this 
curve to trisection of an angle). An alternative definition, using 3 points to 
define a circle, appears on the the other hand to be too weak. 

In spite of providing distance functions with the Euclidean construction 
based query languages, developing appropriate query languages for fixpoint 
queries remain as an interesting issue. It is unclear how EuAIg can be ex- 
tended to capture fixpoint queries, as it was done for EO-Flin [GK97] and for 
topological queries [SV98]. 

Einally, while the restriction to Euclidean geometry is motivated by the 
importance of the distance function in many spatial applications, it remains 
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a natural question to ask whether the current approach can be adapted to 
more general objects. The most obvious such extension would be to allow 
ellipses as well as circles, and to use as a generalization of the circle-drawing 
primitive the construction ellipses with radii from a given set of intervals, and 
foci taken from 2 given objects. Unfortunately, this breaks down even when 
we consider a single radius and foci on a single circle: as shown in [KS98], the 
language we get is not closed. It is an open question whether other approaches 
would be more successful. 

Acknowledgments 

The authors wish to thank Jan van den Bussche and Leonid Libkin for their 
comments. Work by Jianwen Su is supported in part by NSF grants IRI- 
9411330, IRI-9700370, and IIS-9817432, and part of his work was done while 
visiting Bell Labs. 



References 

[AB95] S. Abiteboul and C. Beeri. The power of languages for the manipulation 
of complex values. VLDB Journal, 4(4):727-794, October 1995. 

[AHV95] S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison- 
Wesley, 1995. 

[BDLW98] M. Benedikt, G. Dong, L. Libkin, and L. Wong. Relational expressive 
power of constraint query languages. Journal of the ACM, 45:1-34, 1998. 

[BL98] M. Benedikt and L. Libkin. Safe constraint queries. Proc. ACM Symp. on 
PODS, 1998 

[BBC97] A. Belussi, E. Bertino, and B. Catania. Manipulating spatial data in 
constraint databases. In M. J. Egenhofer and J. R. Herring, editors. Inti. Conf. 
on Advances in Spatial Databases (SSD’97), pages 115-141. Springer Verlag, 
LNCS 1262, 1997. 

[Col75] G. E. Collins. Quantifier elimination for real closed fields by cylindric de- 
compositions. In Proc. 2nd GI Conf. Automata Theory and Formal Languages, 
volume 35 of Lecture Notes in Computer Science, pages 134-83. Springer- 
Verlag, 1975. 

[Due25] A. Diirer. Underweysung der Messung. Niirnberg, 1525 

[GK97] S. Grumbach and G. Kuper. Tractable recursion over geometric data. In 
International Conference on Constraint Programming, 1997. 

[GRS98] S. Grumbach, P. Rigaux, and L. Segoufin. The DEDALE system for 
complex spatial queries. In Proc. ACM SIGMOD, 1998. 

[GS95] S. Grumbach and J. Su. Dense order constraint databases. In Proc. ACM 
PODS, 1995. 

[GS97] S. Grumbach and J. Su. Finitely representable databases. Journal of Com- 
puter and System Sciences, 55(2):273-298, 1997. 

[KKPV97] B. Kuijpers, G. Kuper, J. Paredaens and L. Vandeurzen. First Or- 
der Languages Expressing Constructible Spatial Database Queries Journal 
of Computer and System Sciences, to appear. Preliminary version appeared as 
J. Paredaens, B. Kuijpers, G. Kuper and L. Vandeurzen. Euclid, Tarski, and 
Engeler Encompassed. Proceedings of DBPL’97, LNCS 1369. 




A Representation Independent Language for Planar Spatial Databases 



251 



[KKR95] P. Kanellakis, G. Kuper, and P. Revesz. Constraint query languages. 
Journal of Computer and System Sciences, 51(l):26-52, 1995. 

[KRSS98] G. Kuper, S. Ramaswamy, K. Shim, and J. Su. A constraint-based spatial 
extension to SQL. In Proc. of ACM Symp. on CIS, 1998. 

[KS98] G. Kuper and J. Su Representation Independence and Effective Syntax 
of Euclidean based Constraint Query Languages. Bell Labs Technical Report 
981116-13, 1998. 

[Loc61] E. H. Lockwood. A Book of Curves. Cambridge University Press, 1961. 

[PVV98] J. Paredaens, J. Van den Bussche, and D. Van Gucht. First-order queries 
on finite structures over the reals. SIAM Journal on Computing, 27(61:1747- 
1763, 1998. 

[SV98] L. Segoufin and V. Vianu Querying Spatial Databases via Topological 
Invariants Proc. ACM Symp. on PODS, 89-98, 1998. 

[Tra50] B. A. Trakhtenbrot. The impossibility of an algorithm for the decision 
problem for finite models. Doklady Akademii Nauk SSR, 70:569-572, 1950. 

[Vau60] R. L. Vaught. Sentences true in all constructive models. Journal of Sym- 
bolic Logic, 25(l):39-53, March 1960. 

[VGV98] L. Vandeurzen, M. Gyssens, and D. Van Gucht. An expressive language 
for linear spatial database queries. Proc. ACM Symp. on PODS, 109-118, 1998. 




An Abstract Interpretation Framework for 
Termination Analysis of Active Rules 



James Bailey and Alexandra Poiilovassilis 

Dept, of Computer Science, Birkbeck College, University of London, 
Malet Street, London WCIE 7HX. 

{ James ,ap}@dcs .bbk .ac.uk 



Abstract. A crucial requirement for active databases is the ability to 
analyse the behaviour of the active rules. A particularly important type 
of analysis is termination analysis. We define a framework for modelling 
the execution of active rules, based on abstract interpretation. Specific 
methods for termination analysis are modelled as specific approximations 
within the framework. The correctness of a method can be established 
by proving two generic requirements provided by the framework. This 
affords the opportunity to compare and verify existing methods for ter- 
mination analysis of active rules, and also to develop new ones. 



1 Introduction 

Active databases are capable of reacting automatically to state changes without 
user intervention by supporting active rules of the form “on event if condition do 
action^’’ . An important behavioural property of active rules is that of termination 
and many methods for analysing termination have been proposed. However, in 
many cases it is not clear whether a method developed for one active database 
system would be correct if applied to another. It may also not be clear whether 
there is a general strategy for proving the correctness of a method and under- 
standing the trade-offs made in its design. 

Abstract interpretation has proven a useful tool in the analysis of imperative, 
functional and logic programs [1,10,16]. In this paper we apply it to developing, 
and proving the correctness of, techniques for termination analysis of active 
rules. We develop a general framework for relating real and abstract rule execu- 
tion. Specific termination analysis techniques are modelled in this framework by 
defining specific approximations. The correctness of a technique is established 
by proving two generic requirements provided by the framework. These require- 
ments relate only to individual rules, not to recursive firings of rules. The class of 
active database systems which the framework can model is broad and does not 
assume a particular rule definition language. The formal nature of the framework 
allows a smooth adoption of previous work in query satisfiability, incremental 
evaluation techniques, and approximation techniques. We illustrate the use of 
the framework by developing two abstractions for static termination analysis of 
active rules and a third for dynamic termination analysis. 
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Section 2 presents the framework. This involves defining the concrete exe- 
cution semantics and their abstract counterpart. Section 3 describes how the 
framework can be applied to a particular rule language, rule scheduling seman- 
tics, and approximation. Sections 4, 5 and 6 give three example applications of 
the framework. Section 7 discusses and compares these three techniques. Sec- 
tion 8 compares our approach to rule termination analysis with related work. 
We give our conclusions and directions for future research in Section 9. 



2 The Framework 

We use a typed functional metalanguage to specify both the actual execution 
semantics and their abstract counterpart. In this language, the type Listit) 
consists of lists of values of type t, P{t) of sets of values of type t, (ti, . . . , fy) is 

the n-product of types fy, . . . , fy and fy — > t 2 is the function space from ti to t 2 - 

The operator ^ is right-associative. Function application is left-associative and 
has higher precedence than any operator. [] denotes the empty list and (x : y) & 
list with head x and tail y. The function map : {a b) P{a) —>■ P{b) applies 
its first argument to each element of its second argument. We also require the 
following function which “folds” a binary function / into a list: 

fold : (a->b->a) -> a -> List(b) -> a 

fold f X [] = x 

fold f X (y:ys) = fold f (f x y) ys 

Our specifications are reminiscent of a denotational approach [23] in that we 
represent the database state as a function from a set of identifiers to a semantic 
domain, and define how this state is transformed during rule execution. However, 
our specifications are executable. As we will see, this means that the abstract 
semantics can form the basis for developing practical tests for rule termination. 

2.1 The Execution Semantics 

Each active rule is modelled as a four-tuple consisting of an event query, a 
condition query, a list of actions, and an execution mode. The type Rule is thus: 

Rule = (Query, Query, List(Action) , Mode) 

A rule’s execution mode encodes information about where on the current sched- 
ule the rule’s actions should be placed if the rule fires (e.g. Immediate or Deferred 
scheduling) and to what database state these actions should be bound (e.g. the 
state in which the condition was evaluated, or the state in which the action will 
be executed) — see [17] for a description of the scheduling and binding possi- 
bilities for active rules. We assume that the currently defined rules are held in 
a global list RULES : List(Rule) in order of their priority We also assume 

^ Thus rules are totally ordered and rule firing is deterministic. Handling non- 
deterministic firings of rules of the same priority is an area of future work. 
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that it is possible to derive a delta query from each rule action which encodes 
the change that the action would make to the database state. 

Schedules are lists of actions, and database states are modelled as functions 
from a set of database object identifiers, Id, to a semantic domain, D\ 

Schedule = List (Action) 

DBState = Id -> D 

We assume that there is a distinguished value 0 e D. A rule fires if its event 
query and condition query both evaluate to a value other than 0. 

The Id, D, Query, Action and Mode types referred to above may clearly be 
different for different active database systems. In Section 3 we define them for 
a relational database system, but they can be defined for other types of system, 
such as object-oriented ones. 

There are three kinds of database object identifiers: the set of data identifiers, 
Datald, the set of view identifiers, Viewld, and the set of binding identifiers, 
Bindid (so that Id = DataIdU ViewIdU Bindid). 

Data identifiers: These are objects over which users can specify queries 
and actions e.g. the names of base relations and the names of delta relations in 
a relational database system. 

View identifiers: Suppose that RULES contains n rules, so that there 
are n event queries eqi, . . . eqn, n condition queries cqi, . . . , c(?„ and m > n delta 
queries dqi, . . . dqm- Then, the domain of the database state will contain a set of 
corresponding view identifiers, e\, . . .en,c\, . . . ,Cn,d\, . . .dm, which are mapped 
to the current values of their corresponding queries. 

Binding identifiers: These record a history of the values that the view 
identifiers take throughout the rule execution. As we will see below, this history 
is needed in order to support a variety of rule binding modes. 

We specify the rule execution semantics by a function execSched, listed be- 
low. This takes a database state and a schedule, and repeatedly executes the 
first action on the schedule, updating the schedule with the actions of rules that 
fire along the way. If execSched terminates, it outputs the final database state 
and the final, empty, schedule. If it fails to terminate, it produces no output. 

The function exec (see below) executes the first action, a, on the schedule and 
returns a new database state. This new state, db' say, is then passed to a function 
createSnapshot :DBState->DBState whose definition is straight-forward and 
not listed here. createSnapshot extends the domain of db' with a new binding 
identifier for each view identifier i S dom{db'), setting the value of this new 
binding identifier to be db'{i). These binding identifiers thus create a “snapshot” 
of the current values of the view objects. The event, condition, and delta queries 
of rules that fire as a result of the execution of the action a can be bound to this 
snapshot by updateSched (see below). This snapshot is never updated^. 

^ Clearly in a practical implementation the entire view state does not need to be 
copied each time, only the changes to it. Moreover, snapshots that are no longer 
referenced on the schedule can be discarded from the database state. 
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The function schedRules applies the function schedRule to each rule, 
in order of the rules’ priority. schedRule determines whether a given rule 
{eq,cq,as,m) should fire. If the rule’s event query eq or its condition query 
cq evaluate to 0 in the current database state, then the rule does not fire and the 
database state and schedule remain unchanged. Otherwise, updateSched (see 
below) is called to update the database state and schedule. 

execSched : (DBState , Schedule) -> (DBState , Schedule) 
execSched (db,s) = 
if s = [] 
then (db,s) 

else execSched (schedRules (execUpdate (db,s))) 

execUpdate ; (DBState , Schedule) -> (DBState , Schedule) 
execUpdate (db,a;s) = (createSnapshot (exec (a,db)),s) 

schedRules ; (DBState , Schedule) -> (DBState , Schedule) 
schedRules (db,s) = fold schedRule (db,s) RULES 

schedRule : Rule -> (DBState , Schedule) -> (DBState , Schedule) 
schedRule (eq, cq, as ,m) (db,s) = 

if (empty eq db) or (empty cq db) 
then (db,s) 

else updateSched (as,m,db,s) 

Three functions therefore remain to be defined for any specific rule language 
and execution semantics, empty, exec and updateSched. 

empty; Query -> DBState -> Bool determines whether a query evaluates 
to 0 with respect to the current database state. 

exec : (Action, DBState) -> DBState takes an action, a, and a database 
state, db, and updates the values of the data objects and view objects in db to 
reflect the effect of a. We assume that if a rule’s event query or condition query 
evaluates to 0 then, were they to be scheduled, the rule’s actions would have no 
effect i.e. that exec (a, db) = db for any such action a and any database state 
db. We call such actions null actions. Null actions aren’t actually scheduled by 
execSched but do need to be considered by the abstract execution semantics in 
order to reflect all possible concrete executions (see outline proof of Theorem 1 in 
Section 2.3). An easy way to guarantee that all rule actions satisfy this property 
is to encode the rule’s event and condition queries within each of the rule’s 
actions — note this has no effect on the semantics of a rule. Thus, each rule 
action has the notional form if eq A cq then update(dq) where eq is the event 
query, cq the condition query, dq the delta query, and update is some expression. 
The precise syntax of rule actions will vary from system to system. 

updateSched : (List (Action) , Mode , DBState , Schedule) -> (DBState , Sch- 
edule) takes a rule’s list of actions as, its execution mode m, the current 
database state db, and the current schedule s, and does the following: 



256 James Bailey and Alexandra Poulovassilis 

(i) Replaces the event and condition query encoded within each action a G as 
by its corresponding snapshot view identifier. This binds the encoded event 
and condition queries to the current database state. 

(ii) If the execution mode m states that the rule’s action must also be bound 
to the current state, then the delta query appearing within each a € as is 
also replaced by its corresponding snapshot view identifier. 

(iii) The resulting reaction transaction is inserted into the appropriate part of 
the schedule, as indicated by m. 

We assume that this processing is independent of the values that database iden- 
tifiers are mapped to, so that updateSched in fact has a more general signa- 
ture (List(Action) , Mode, Id->a,Schedule)->(Id->a, Schedule) where a is 
a type variable. In other words, updateSched is polymorphic over the semantic 
domain D. This means that updateSched can also be used in the abstract execu- 
tion semantics. We assume that createSnapshot is similarly polymorphic over D, 
and it too is also used in the abstract execution semantics. 

2.2 The Abstract Execution Semantics 

We are now ready to define the abstract counterpart, execSched* , to execSched. 
The definition of execSched* is listed below. We distinguish abstract types and 
functions by suffixing their names with a The abstract database state type 
is defined by DB State* = Id ^ D*, where D* is the abstract counterpart 
to the semantic domain D. In general, D* will be different for each particular 
abstraction. There needs to be a distinguished constant 0* G D* which is the 
abstract counterpart to % G D. Rules and schedules are syntactic objects which 
are common to both the concrete and the abstract semantics. 

We see that execSched* is identical to execSched except that it operates on 
an abstract database state type and that at the “leaves” of the computation 
the functions empty and exec are replaced by abstract counterparts empty* and 
exec*, empty* : Query->DBState*->Bool determines whether a query evaluates 
to 0* with respect to an abstract database state, exec*; (Action,DBState*) 
->DBState* executes an action on the data objects and view objects in an ab- 
stract database state. As we discuss further in Section 3, these two functions 
need to be defined for each specific abstraction. 

execSched* : (DBState* , Schedule) -> (DBState* , Schedule) 
execSched* (db*,s) = 
if s = [] 
then (db*,s) 

else execSched* (schedRules* (execUpdate* (db*,s))) 

execUpdate* ; (DBState* , Schedule) -> (DBState* , Schedule) 
execUpdate* (db*,a;s) = (createSnapshot (exec* (a,db*)),s) 



schedRules* ; (DBState* , Schedule) -> (DBState* , Schedule) 
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schedRules* (db*,s) = fold schedRule* (db*,s) RULES 

schedRule* : Rule -> (DBState* , Schedule) -> (DBState* , Schedule) 
schedRule* (eq,cq,as,m) (db*,s) = 

if (empty* eq db*) or (empty* cq db*) 
then (db*,s) 

else updateSched (as,m,db*,s) 

2.3 Correctness of the Abstract Semantics 

An abstract database approximates a number of real databases. These possi- 
ble concretisations are obtained by applying a concretisation function (see Sec- 
tion 2.4) to the abstract database. 

Termination is an undecidable property even for simple active rule lan- 
guages [-5]. execSched* is a safe approximation if the equivalence below holds 
for all db* € DBState* and s S Schedule, where cone is the chosen concreti- 
sation function and C the information ordering on P(DBState, Schedule) (see 
Section 2.5): 

cone {execSched* [db* , s)) Q map execSched {cone {db* , s)) 

The LHS of this equivalence corresponds to the set of possible concrete 
databases and schedules obtained by first running the abstract execution on 
{db*,s) and then deriving all possible concretisations of the resulting abstract 
database and schedule (note that if execSched* terminates then all the concre- 
tised final schedules will be empty). The RHS corresponds to first deriving all 
possible concretisations of {db* , s) and then applying the real execution to each 
concrete database and schedule pair. 

Thus a safe approximation yields no more information than the real execution 
would. The following theorem states sufficient conditions for this property to 
hold: 

Theorem 1. execSched* is a safe approximation if 

(i) for all actions a and all db* G DBState* , 

cone {exec* {a,db*)) C map exec {cone {a,db*)), and 

(ii) for all event or condition queries q and all db* € DBState* , 
empty* q db* = True => {\/db € cone db* . empty q db = True) 

Proof (outline). Since we a using a functional metalanguage, fixpoint in- 
duction [23] can be employed. The values of execSched* and execSched are given 
by U{F*(T) I i > 0} and U{G*(T) | i > 0} respectively, where 



F = Xf.X{db* , s).if {s = []) then {db* , s) else f {schedRules* {execUpdate* {db* , s))) 
G — Xg.X{db,s).if {s = []) then {db,s) else g {schedRules {execUpdate {db, s))) 

It is thus sufficient to show that cone o F"*(T) C {map G®(T)) o cone for all 
i > 0. The base case of i = 0 is straightforward. For the inductive case, we can 



258 



James Bailey and Alexandra Poulovassilis 



show by provisos (i) and (ii) of the theorem that 

cone o schedRules* o execUpdate* C {map {schedRules o execUpdate)) o cone 

It is here that null actions may arise since the concretisation of the abstract 
database and schedule on the LHS may generate database, schedule pairs in 
which the schedule contains null actions whereas the corresponding database, 
schedule pair in the RHS will not include such actions. Using the above equiv- 
alence and the induction hypothesis, the inductive case of the theorem follows. 
□ 



2.4 Defining the cone Functions 

A concretisation function is needed for the argument type of each of the func- 
tions called by execSched* . We observe from the definition of execSched* 
that there are four different argument types: DBState* , {D B State* , Schedule), 
{Action, DBState*), and {List{Action), Mode, DBState*, Schedule). We use 
the concretisation function over DBState* to define the concretisation func- 
tions over the other three argument types. These definitions are indepen- 
dent of the particular abstraction adopted. We give them here and they 
can be reused for each specific abstraction considered later in the paper. 
In particular, given a definition for cone : DBState* — > P{DBState), the 
concretisation functions over {DBState* , Schedule), {Action, DBState*) and 
{List{Action), Mode, DBState* , Schedule) are respectively defined as follows 
(we overload the identifier cone since its type can always be inferred from con- 
text): 

cone {db*, s) = {{db, s) \ db ^ cone db*} 

cone {a,db*) = {{a,db) \ db <— cone db*} 

cone {as, m, db*, s) = {(as, m, db, s) | <— cone db*} 

2.5 Defining C 

The concrete domain D, together with the information ordering on it Qd, need 
to be defined for a given active database system. For example, in Section 3 we de- 
fine D for a relational database system. The domain DBState is derived from D. 
The orderings on DBState, Schedule, P{DBState) and P{DBState, Schedule) 
are derived in a standard fashion, independent of the particular choice of D. 
We define these orderings here, and they can be reused for all the abstractions 
considered later in the paper. 

The ordering on DBState, Qdb state, is the usual ordering on a function 
space and is defined in terms of Co as follows, where dbi, db 2 S DBState: 

dbi QnBState db2 if \/i G Id. dbi i Co db2 i 

The ordering on P{DBState) is as follows for DBi,DB 2 G P{DBState): 

DBi Qp(DBState) DB 2 if Vd&2 G DB 2 . 3db\ G DBi . dbi QoBState db2 



An Abstract Interpretation Framework for Termination Analysis 



259 



Note that this is the Smyth powerdomain ordering [23]. Since the sets of database 
states that we consider in our abstract interpretation framework are generated 
by concretising a particular abstract database state, this reflects the intuition 
that more precise abstractions give rise to smaller sets of database states. 

To define the ordering on Schedule (= List(Action)) we require an operator 
reduce : Schedule — > Schedule which removes all null actions from a schedule. 
Then ^schedule IS as follows for si,S 2 G Schedule: 

Si Qscheduie S 2 if {reduce{s\) = reduce{s 2 )) A {length(si) > length{s 2 )) 

Thus si,S 2 are identical apart from any null actions they may contain, except 
that S 2 is no longer, and hence has no worse termination properties, than si. 

Finally, the ordering on P{DB state, Schedule) is as follows for DBS\, DBS 2 
G P{DB State, Schedule): DBSi Qp(DBState, Schedule) DBS 2 if 

V(d& 2 , S 2 ) G DBS 2 ■ 3{dbi,si) G DBSi . dbi QoBState d &2 A si ^Schedule ^2 

Generally, we do not subscript the G symbol since which of the above order- 
ings is meant can be inferred from context. 

3 Applying the Framework 

To summarise, in order to use the framework just described, one needs to define: 

(i) the concrete domain D, 

(ii) the empty and exec functions, 

(iii) the updateSched function, 

(iv) the abstract domain D*, 

(v) the concretisation function cone : DBState* — > DBState, and 

(vi) the empty* and exec* functions. 

Parts (i) and (ii) are specific to the particular query/update language, part (iii) 
to the particular rule scheduling semantics and parts (iv)-(vi) to the particular 
abstraction. Once an abstraction has been defined, one needs to show that the 
two requirements for Theorem 1 hold. 

For the remainder of the paper we assume a relational database system, with 
the relational algebra as the query/update language. Referring to point (i) of 
the framework, in this case the concrete domain is 

D = P{Const) + P{Const, Const) + P{Const, Const, Const) + . . . 

where Const is a possibly infinite set of constants and -I- is the disjoint union 
domain constructor [23]. Thus, each member of Z? is a set of n-tuples, for some 
n > 0. The distinguished constant 0 is the empty set, {}. 

Referring to point (ii) of the framework, in order to define exec and empty 
we first need some auxiliary sets. Let Expr be the set of relational algebra 
expressions over D. Let BaseRel be the set of base relation names and DeltaRel 
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the set of delta relation names, so that Datald = BaseRel U DeltaRel. The 
syntax of e S Expr, q S Query, a € Action and dr € Deltald is then as follows, 
where d G D, i G Id and r € BaseRel: 

e ::= d I ei U 62 I ei X 62 I ei — 62 I ei n 62 I crge | 7r^e 
q \ qiUq2 \ qi x q2 \ qi - q2 \ qi Q2 \ crgq \ HAq 
a ::= insert r q \ delete r q 
dr ::= +Ar \ — Ar 

For ease of exposition, we assume that event queries have the form +AR or — AR, 
although in principle they could be arbitrarily complex. An example active rule 
is thus: on + ARi if i?2 H i?3 do delete R2 + ARi. 

With respect to point (iii) of the framework, we do not define a specific rule 
scheduling semantics for our archetypal relational database system as this is 
in fact immaterial to the termination analysis techniques that we will explore 
(hence, we have omitted the mode part from the above example active rule). For 
each abstraction that we consider in subsequent sections of the paper, it thus 
only remains to address points (iv)-(vi) of the framework. 

We recall from Section 2.1 our requirement that if a rule’s event query or 
condition query evaluates to 0 then, were they to be scheduled, the rule’s actions 
would have no effect. For any user-specified rule action of the form insert r q, 
the corresponding delta query is q — r and the rule action can be rewritten 
to insert r TTatts{r)(eq X cq X {q — r)) where eq and cq are the rule’s event 
query and condition query, respectively. This has the effect of encoding the rule’s 
event and condition queries within the action, without changing the semantics 
of the action. Similarly, for any user-specified rule action of the form delete r q, 
the corresponding delta query is q H r and the rule action can be rewritten to 
delete r ■natts(r){eq x cqx {qCir)). 

In order to define exec and empty we assume two auxiliary functions, eval 
and bind. The function eval : Expr — > I? is a relational algebra evaluator i.e. it 
evaluates an expression to the set of tuples in D that it denotes. The function 
bind : Query DB State Expr substitutes occurrences of identifiers i G Id 
in a query by their values in the given database state and returns the resulting 
expression, empty and exec are then defined as follows 



empty q db = {eval (bind q db) = {}) 

exec {insert r q, db) = let dbi = db[r 1— » eval {bind (r U q) db), 

+Ar I— » eval {bind {q — r) db), — Ar {}, 

Vs G BaseRel, s A r.{-\-As 1— » {}, —As {})] 
in d 6 i[Vi G Viewld.{i 1— > eval {bind iq dbi))] 

^ We assume that the queries q in the arguments to exec have been rewritten to encode 
the event and condition queries as described earlier. We use maplets, a 1— > 6 , to denote 
members of functions. Given a function /, f[ai 1— > wi,...,a„ Vn] is short-hand 
for the function (/ — {a\ f ai, . . . ,an f Un}) U {ai 1— > ui, . . . , a„ Un}. 
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exec {delete r q, db) = let dbi — db[r eval {bind {r — q) db), 

+Ar I— > {}, —Ar i— > eval {bind {q n r) db), 

Vs e BaseRel,s 7^ r.(+Z\s 1— > {}, —As {})] 
in d6i[Vi G Viewld.{i 1— > eval {bind iq dbi))] 

Thus, empty q db binds the query q to the database state db and then calls eval 
to evaluate the resulting expression, exec {insert r q, db) replaces the current 
value of r in db by the result of evaluating r\Jq w.r.t. db, updates the values of 
all the delta relations, and updates the values of all the view identifiers, where iq 
is the query corresponding to a view identifier i. exec {delete r q,db) is similar. 



4 Abstraction 1: An Abstraction for Static Termination 
Analysis 

In this section we define an abstraction for statically inferring the termination 
of a set of rules and show that it is a safe approximation. We first address 
points (iv)-(vi) of the framework. The abstract domain D* consists of the set of 
relational algebra queries q G Query as defined in the previous section, together 
with the value 0* = {}. exec* replaces the abstract value of each data identifier by 
a new abstract value, obtained by binding to the current abstract database state 
the query that would have been evaluated by exec in the concrete semantics: 



exec* {insert r q,db*) = let dbi = db*[r bind {rUq) db* , 

+Ar I— > bind {q — r) db* , — Ar i— > {}, 

Vs G BaseRel, s yf r.(+As i— > {}, —As {})] 
in db*[Vi G Viewld.{i i— > bind iq d6*)] 
exec* {delete r q,db*) = let db^ = db*[r bind {r — q) db* , 

+Ar I— > {}, —Ar i— > bind {q n r) db* , 

Vs G BaseRel, s yf r.(+Zis i— » {}, —As {})] 
in db*[Vi G Viewld.{i i— > bind iq d6*)] 

The concretisation function cone : DB State* DB State when applied to an 

abstract database state db* returns a set of concrete database states db] each db 
is obtained by first choosing a function pdb : Id D which defines an initial 
value for each database identifier, and then setting db i = eval {bind {db* i) pdb) 
for all i G Id. The remaining cone functions are derived as in Section 2.4. 

Defining empty* amounts to selecting a method for analysing the satisfiabil- 
ity of queries. Satisfiability is undecidable for the general relational algebra, so 
we must ensure that the queries being tested are expressed in some decidable 
fragment. One decidable fragment consists of queries where the argument to 
projection operations does not contain any occurrence of set-difference [22]. To 
guarantee that the queries passed to empty* have this form, we use a function 
normalise that safely rewrites a query into an approximating interval of two 
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queries. This function applies bottom-up the following rules except that the rule 
for Cl — 62 becomes [{},&] for any sub-expression ei — 62 occurring within the 
scope of a projection operation. In these rules, a, b, c, d are relational algebra 
expressions, a C b, c C d, ei = [a, 6] and 62 = [c, c?]: 

6i U 62 = [a U c, 6 U d] 61 — 62 = [a — d, 6 — c] 

610 62 =[anc, 6 nd] CTe(6 i) = [as{a), < 7 g{b)] 

-^A{ei) = [ 7 TA(a), 7 TA(d)] 61 X 62 = [a X c, 6 X d] 

[[a,b], [c,d]] = [a,d] 

e.g. normalise(TTxi,{,Ri x R2) — R3) x {R2 — Ri)) = [ 0 , t^x{Ri x R2) x (i?2 — d^i)]. 
The definition of empty* is thus: 

empty* q db* = satis f table (normalise (bind q db*)) 

where satis f table applies the satisfiability test of [ 22 ] to the upper query of the 
interval output by normalise. 

The following theorem is then straight-forwardly shown: 

Theorem 2 . execSehed* is a safe approximation of execSched, given the above 
definitions of D*, cone, exec* and empty* . 

Example 1 . Consider the following two rules: 

Rule 1 : on +AR if (i? n S') x (i? n T) do delete R T 
Rule 2 : on —AR if true do insert R (S — T) 

If rule 1 is the first rule triggered, then a trace of the abstract execution 
on each successive call to execSched* is as follows, where ‘Iter’ denotes the 
current iteration, action(i) the action of the rule and / is an arbitrary unused 
identifier denoting that +AR is satisfiable at the first iteration: 



Iter 


Database State 


Schedule 




R 


S 


T 


+AR 


-AR 




1 


Ro 


So 


To 


I 


{} 


[action(l)] 


2 


1 

0 


So 


To 


{} 


Ro n To 


[action(2)] 


3 


(Ro — To) U (So — To) 


So 


To 


1 

1 

0 

1 


{} 


[action(l)] 



Note that in this example the execution mode of the rules is immaterial since 
at most one rule is ever triggered. At iteration 3 , the condition of rule 1 expands 
to the expression (((i?o - ?d) U (So - Tq)) C Sq) x (((Rq - Id) U (Sq - Id)) C Tq), 
which is unsatisfiable. Hence execSched* terminates after three iterations and 
definite rule termination can be concluded if rule 1 is the first rule triggered. □ 
Of course in general there is no guarantee that execSched* itself will termi- 
nate and so we need a criterion for halting it (concluding in such a case that 
the concrete rule execution may fail to terminate). A simple way is to choose a 
priori a bound on the number of iterations of execSched* . Increasing this bound 
allows more precision but must be balanced against the time and space resources 
available. Another strategy for halting execSched* is to choose a more limited 
abstract domain D* that makes DBState* finite. execSched* can then be halted 
if a repeating argument (db* , s) is detected since in such a case execSched* would 
not terminate. We demonstrate such an abstraction in the next section. 
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5 Abstraction 2: A Coarser Abstraction for Static 
Termination Analysis 

We now set D* to be the two-valued domain {Unknown, False}, where 0* = 
False. conc{db*) returns the set of databases db such that 

— the values assigned to the data identifiers and view identifiers are consistent, 

— db{i) = 0 if db*{i) = False 

empty* q db* returns True if db*{q) is False. Otherwise it returns False. 

To define exec* we need to distinguish between view identifiers correspond- 
ing to event queries, condition queries and delta queries, so we assume that 
Viewld = Eqld U Cqld U Dqld. exec* uses a function inf er {condition, update) 
to infer the new value of a condition query. The infer function returns either 
(a) Unknown if the condition may be true after the update executes, or (b) 
False if the condition can never be true after the update executes. This kind 
of inference corresponds to the determination of arcs in an activation graph [7] 
and so a method such as the propagation algorithm described in [8] can be used 
to determine the effect of updates on conditions (with exponential complexity 
in the worst case). This, of course, is a non-effective method, and so infer will 
in general sometimes return Unknown when it could have returned False (but 
if not, we term it a perfect inference function). 

exec* {insert r q,db*) sets the values of r and -\-Ar to Unknown and the 
values of all other delta relations to False. It then updates the values of the view 
identifiers by setting to U nknown the values of all event queries of the form +Ar 
and to False the values of all other event queries, using infer to infer the new 
values of all the condition queries, and setting to Unknown the values of all the 
delta queries: 



exec* {insert r q,db*) = 

let db{ = db*[r i— » Unknown, -eAr i— > Unknown, —Ar False, 

Vs G BaseRel, s 7^ r.{-\-As 1— > False, — As 1— » False] in 
d6(|Vei G Eqld,eqi = +Ar.Ci Unknown, 

'ici G Eqld,eqi 7^ -\-Ar.ei False, 

'ici G Cqld.Ci H- » inf er{cqi, insert r q), 

'idi G Dqld.di 1— > Unknown] 

Actions of the form delete r q are handled similarly. 

Theorem 3. execSched* is a safe approximation of execSched, given the above 
definitions of D* , cone, exec* and empty*. 

The initial abstract database state now maps all identifiers to Unknown, 
apart from event queries that could not have triggered the initial rule and their 
corresponding delta relations, all of which are mapped to False. 
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Example 2. Consider again the rules defined in Example 1. The following 
is a trace of the abstract execution when rule 1 is triggered, assuming a perfect 
inference function (cond{i) denotes the condition query of the rule): 



Iter 


Database state 


Schedule 




cond(l) 


cond{2) 




1 


U nknown 


Unknown 


[action{l)] 


2 


False 


Unknown 


[action{2)] 


3 


U nknown 


Unknown 


[action{l)] 



At iteration 2, rule I’s condition becomes False, since its action will always 
falsify ROT. At iteration 3 the execution of rule 2’s action makes rule I’s 
condition Unknown again (because only the overall truth values of conditions 
are recorded, so infer cannot make use of the fact that RHT was previously 
False). The third state is thus a repetition of the first, so execSched* is halted 
and possible rule non-termination is concluded if rule 1 is the first rule triggered. 

Example 3. Suppose now we change the condition of rule 1 to be just ROT. 
The following trace is then obtained when rule 1 is the first rule triggered, again 
assuming a perfect inference function: 
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Database state 


Schedule 




cond(l) 


cond(2) 
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U nknown 


Unknown 


[action{l)] 
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False 


Unknown 


[action{2)] 
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False 


Unknown 


[action(l)\ 



Changing rule I’s condition means that rule 2’s action can no longer alter 
the truth value of this condition. Thus, execSched* terminates at iteration 3, 
since rule I’s condition is False, and definite rule termination is concluded if 
rule 1 is the first rule triggered. □ 

The above infer function uses only the syntax of conditions and actions to 
deduce new values for the event and condition queries. This motivates the devel- 
opment of our third abstraction which uses a more refined inferencing method 
and also more precise information about the initial abstract database state de- 
rived from the current concrete database state i.e. our third abstraction is useful 
for dynamic termination analysis. 

6 Abstraction 3: An Abstraction for Dynamic 
Termination Analysis 

As a third example application of our framework, we now show how to use 
the abstract execution semantics for dynamic, as opposed to static, termination 
analysis. This can give more precision because in a dynamic setting the initial 
database state can be described more accurately rather than just consisting of 
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unknown information. This is reflected by the inclusion of the value True in the 
abstract domain. 

D* is now the three-valued domain {True, Unknown, False}, where 0* = 
False. conc{db*) now returns the set of databases db such that 

— the values assigned to the data identifiers and view identifiers are consistent, 

— db{i) = 0 if db*{i) = False, 

— db\i) yf 0 if db*\i) = True. 

empty* q db* returns True if db*{q) is False. Otherwise it returns False. 

exec* calls an inference function infer.delt to compute the new truth value 
of the delta queries. This either (a) returns True if a delta query was previously 
True and the update to r can never cause deletions from it, or (b) returns False 
if the delta query was previously False and the update can never cause inser- 
tions to it, or (c) returns U nknown otherwise. Such logic is implementable by 
incremental propagation techniques such as [20,11] which perform query rewrit- 
ing to determine if insertions or deletions can have an effect on an expression, 
and has polynomial complexity. Once again, this is in general a non-effective 
method and the infer^delt function will sometimes have to return Unknown 
instead of True or False in order to be safe. 

The new value of the condition queries is similarly given by a function 
infer^con. If the new value of the delta query for the update is False, then 
the value of the condition is unchanged. Otherwise, infer^con either (a) returns 
True if the condition was previously True and the update to r can never cause 
deletions from it, or (b) returns False if the condition was previously False and 
the update can never cause insertions to it, or (c) returns U nknown otherwise. 
The new value of the event queries is given by a function infer-ev which sets 
the value of an event query to the value of its corresponding delta relation. 

ea;ec* {insert r q,db*) infers new values for all the delta queries, sets the 
value of -\-Ar to that of the delta query corresponding to the action insert r q 
(this query is denoted by dq below), sets the value of —Ar to False, sets the 
value of all other delta relations to U nknown, and infers new values for the view 
identifiers corresponding to event queries and condition queries: 



exec* {insert r q,db*) — 

let db{ = d 6 *[\/di G Dqld.di infer.delt{db* di, insert r)j in 
let d&2 = dh\[-\-Ar db{ dq, — Ar i— > False, 

Vs G BaseRel, s ^ r.{—As Unknown, -tAs Unknown)] in 
db2{\/ci G Eqld.Ci i— > infer.ev{db2), 

yci G Cqld.Ci infer.con{db2 Ci,db2 dq, insert r)j 

Actions of the form delete r q are handled similarly. 

Theorem 4. execSched* is a safe approximation of execSched, given the above 
definitions of D*, cone, exec* and empty* . 
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The initial abstract database state is the same as for Abstraction 2, except 
that (a) event queries that could have triggered the initial rule, and their associ- 
ated delta relations, are now True instead of Unknown, and (b) the values of the 
condition and delta queries will be one of {True, False, Unknown} depending 
on the concrete run-time state when abstract execution is invoked. 

Example 4. Consider again the rules 1 and 2 as originally defined in Example 
1. For an initial database state in which rule I’s condition is known to be True 
and rule 2’s to be False, we obtain the following trace, assuming a perfect 
inference function: 
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cond(l) 
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[action(l)[ 
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[action{2)[ 



Thus execSched* terminates at iteration 2 because of the falsity of rule 2’s 
condition. If, however, this was initially True, the trace would be as shown 
below. The last state is now a repetition of the second and so possible rule 
non-termination is concluded: 
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7 Discussion and Comparison of Abstractions 1—3 

The termination analysis directly provided by execSched* is for a given set 
of rules and a given initial schedule. For example, in Examples 1-4 the ini- 
tial schedule consisted of the action of one of the rules. A broader question is, 
will a given set of rules terminate for any finite initial schedule? For reasons of 
space we cannot give an in-depth treatment of this issue here and will do so 
in a forthcoming paper. However, briefly, if all rules have Immediate coupling 
mode then it is sufficient to run execSched* on each possible singleton sched- 
ule i.e. on two schedules [insert R /] and [delete R /] for each base relation R 
in the database; execSched* terminates on all finite schedules if and only if it 
terminates on all singleton schedules. If some rules may have Deferred coupling 
mode, let execSched*"^ be execSched* modified so as to reinitialise the abstract 
database state at the start of each deferred sub-transaction; then, execSched* 
terminates on all finite schedules if execSched*"^ terminates on all singleton 
schedules. Unlike with Immediate-only coupling mode, this test is safe but may 
not be precise. Indeed, the general undecidability of termination of queue sched- 
ules [4] means that a precise test cannot be devised. 
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We now draw some comparisons between Abstractions 1-3. In general, none 
of them subsumes any of the others in the sense that all three can detect cases of 
definite termination that the other two cannot. Abst 3 is not directly comparable 
to Absts 1 and 2 since it will have available information about the initial database 
state. Abst 1 and 2 are distinct in that they use different satisfiability tests 
and so the resulting behaviour of empty* is different for each. With Abst 1, 
exec* maintains a precise record of the execution history (thus requirement (i) 
of Theorem 1 is actually an equality) and all the approximation is carried out 
by the empty* function. Limiting the abstract domain in order to be able to 
detect repeating states, as was done for Absts 2 and 3, moves the approximation 
functionality from empty* to exec* . 

We note that our framework provides a useful basis for determining the 
relative costs of different analysis methods, as these will only differ in their 
exec* and empty* functions. The table below summarises the worst case costs 
for Absts 1-3. The costs of empty* for Abst 1 and exec* for Abst 2 are both 
exponential in the size of the expressions input to them. However, the size of 
expressions input to the latter is constant, whereas for the former the expression 
size grows at each iteration. Thus, Abst 1 is likely to become more expensive in 
practice. We thus envisage Abst 2 being applied in scenarios where the rule set 
is very large (and hence a cheap analysis technique is required) whereas Abst 1 
is more suitable for a deeper analysis of a small rule set. Its low computational 
cost makes Abst 3 suitable for run-time termination analysis. 
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8 Related Work 

Previous methods of active rule analysis have often been closely linked to specific 
rule languages. In contrast, the framework that we have presented here does not 
assume any particular rule definition language. Our rule execution semantics 
can simulate most standard features of active database systems and prototypes. 
A detailed comparison is beyond the scope of this paper, but features such as 
SQL3’s statement-level and row-level AFTER triggers [13], and variations of 
event consumption and rule coupling modes are either directly available or can 
be simulated within the framework. An alternative to the functional specification 
approach we have adopted would be to use while or whilcN programs [18] for 
the concrete execution semantics, and derive abstract versions of these. The 
advantage of a functional approach is that parameters such as rule language 
and scheduling strategy are explicit. The advantage of while programs would be 
easier identification of rule cycles and monotonicity of relations. 
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In [14], some general ideas regarding approximation in databases are dis- 
cussed, motivated by the classification of methods for incompleteness of infor- 
mation. Other related work [9] looks at approximate query evaluation in the 
context of what precision is theoretically possible when approximating Datalog 
by conjunctive or first-order queries. Our work is complementary to [14,9] in that 
we have developed a framework designed specifically for active rule analysis and 
we have developed three example abstractions for a first-order language. These 
abstractions serve to illustrate the application and usefulness of our framework, 
as well as being useful heuristics for rule termination analysis in their own right. 

Regarding other work in rule termination analysis, [12] proposed using sat- 
isfiability tests between pairs of rules to refine triggering graphs. This can be 
interpreted as a less sophisticated version of the empty* function of our Abst 
1 which does satisfiability testing for sequences of rule conditions and actions. 
Thus, in Example 1, if we only considered the relationship between pairs of 
rules, we would derive the information that rules 1 and 2 can mutually trigger 
each other. By itself, this information does not allow us to make any conclusions 
about termination behaviour. However, Example 1 demonstrated that by doing 
satisfiability testing on an accumulated “history” expression, it can be concluded 
that rule execution must terminate if rule 1 is the initially triggered rule. 

Abst 2 does focus on pairwise relationships between rules and so is similar to 
graph-based analysis techniques based on the notions of triggering and activating 
rules [2,8,7,12]. Abst 2 is not identical to these, however, since in their pure form 
graph-based techniques do not “execute” the rule set and so have no notion of 
control flow. This can cause loss of precision for rule sets where the order of 
execution prohibits certain activation states of conditions. Extensions to these 
techniques are certainly possible to recognise such situations, but do not arise 
naturally from the specifications of the techniques themselves. 

Using abstract interpretation for termination analysis of active rules was first 
considered in [3], for a theoretical rule language manipulating variables rather 
than relations and for a less expressive execution semantics. In [6] we refined 
that work to a full relational language and explored Abst 1 for the PEL active 
database system [21,19] i.e. assuming a specific updateSched function. 

Of course an alternative to developing techniques for rule termination analy- 
sis is to design rule languages which a priori cannot give rise to non-terminating 
computations e.g [25,15,24]. However ,the current SQL3 proposal [13] allows non- 
terminating rule sets to be defined. Thus we believe that developing techniques 
for rule termination analysis is both necessary and relevant to current practice. 

9 Summary and Further Work 

We have defined a framework for the abstract execution of active rules, and have 
illustrated its application by developing and showing the correctness of three 
techniques for rule termination analysis. Applying the framework is a matter of 
addressing points (i)-(vi) in Section 3 and proving that the two requirements 
of Theorem 1 hold. We envisage the framework as a useful general tool for 
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the development, comparison and verification of abstraction-based analysis and 
optimisation methods for active databases. Of the three abstractions that we 
have presented here, the first provides a detailed static analysis, the second a 
cheaper static analysis and the third a cheap dynamic analysis. 

We are currently implementing a rule analysis tool that supports our frame- 
work. Avenues for further work include: classifying other active rule analy- 
sis methods using our framework, and thereby verifying their correctness and 
domain of applicability; extending the framework to allow abstraction at dif- 
ferent granularities of database objects, abstraction over schedules, and non- 
deterministic firings of rules with the same priority. 
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1 Introduction 

Currently, there is tremendous interest in semi-structured (SS) data manage- 
ment. This is spurred by data sources, such as the AGeDB [29], that are in- 
herently less rigidly structured than traditional DBMS, by WWW documents 
where no hard rules or constraints are imposed and “anything goes,” and by in- 
tegration of information coming from disparate sources exhibiting considerable 
differences in the way they structure information. Significant strides have been 
made in the development of data models and query languages [2,11,17,6,7], and 
to some extent, the theory of queries on semi-structured data [1,23,3,13,9]. The 
OEM model of the Stanford TSIMMIS project [2] (equivalently, its variant, in- 
dependently developed at U. Penn. [11]) has emerged as the de facto standard 
model for semi-structured data. OEM is a light-weight object model, which un- 
like the ODMG model that it extends, does not impose the latter’s rigid type 
constraints. Both OEM and the Penn model essentially correspond to labeled 
digraphs. A main theme emerging from the popular query languages such as 
Lorel [2], UnQL [11], StruQL [17], WebOQL [6], and the Ulixes/Penelope pair 
of the ADM model [7], is that navigation is considered an integral and essential 
part of querying. Indeed, given the lack of rigid schema of semi-structured data, 
navigation brings many benefits, including the ability to retrieve data regardless 
of the depth at which it resides in a tree (e.g., see [4]). This is achieved with 
programming primitives such as regular path expressions and wildcards. A sec- 
ond, somewhat subtle, aspect of the emerging trend is that query expressions 
are often dependent on the particular instance they are applied to. This is not 
surprising, given the lack of rigid structure and the absence of the notion of a 
predefined schema for semi-structured data. In fact, it has been argued [4] that 
it is unreasonable to impose a predefined schema. 

There are two problems with the current state of affairs. First, while classi- 
cal database theory tells us that queries are computable and generic functions 
from database instances over a predefined schema to (say) relations over an- 
other schema, the state-of-the-art theory of semi-structured databases (SSDB) 
does not exactly pin down the input/output type of queries. Indeed, it is un- 
clear what it means to query SSDBs in a formal sense. Second, the notion of 
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genericity [16] plays a pivotal role in the theory of queries. We contend that this 
notion has not been satisfactorily addressed for semi-structured data. Genericity 
says that the order of columns or rows in a relation is irrelevant for purposes 
of querying. In its purest form, it also says that the domain values are unin- 
terpreted. One may argue that the structure present in a SSDB, in the form 
of parent-child relationships and/or left-right ordering, carries inherent informa- 
tion, that should be regarded on a par with the data for purposes of querying. 
However, drawing lessons from the analogy to the classical notion of genericity, it 
stands to reason to say that if the same underlying data in one SSDB is modeled 
differently using different structure in a second SSDB, the two SSDBs should 
be insensitive to the difference in representation between the two SSDBs. We 
call this property representation independence. These notions are made precise 
in Section 3. Here we give some examples to illustrate the underlying intuition. 
We use the OEM model as a representative SS data model and the fragment 
of the StruQL language presented in [18] as a representative query language for 
convenience and conciseness. Our observations hold for any of the prominent 
models/query languages mentioned above. 



Guide 




Fig. 1. A Semi-structured Restaurant Guide 



Gonsider the OEM SSDB in Figure 1, showing (partial) details about restau- 
rants in Ganada. It shows information about gourmet and fastfood restaurants 
and has some heterogeneity in the way the information is represented. This could 
be the result of integration of information from different sources. Now, consider 
the query “find pairs of cities and restaurant names in them.” One way to express 
this query in the StruQL fragment is to use two rules: 
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ri : q{X,Y) < Guide (Gourmet.*) Z, Z (City) X, Z (*.Name) Y. 

T 2 : q{X,Y) < Guide (Fastfood.*) Z, Z (Name) Y, Z (*.City) X. 

Here, X, Y, Z are node variables, and Guide is the entry point (for the root). 
To write this query expression, one needs to know that gourmet restaurants are 
organized with a different structure than the fastfood ones. If all information 
had been organized following the gourmet structure, we would have written the 
query using just rule ri. If this rule were applied to an “equivalent” database 
storing the same information using the fastfood structure (for which T 2 would 
be the right rule) would not yield the expected answer. Thus, intuitively, none 
of these expressions is representation independent. One way out of this might 
be to write the one-rule expression 

ra : q{X,Y) < Guide (*) Z, Z (*.City) X, Z (*.Name) Y . 

Since the wildcard * can be mapped to the empty label, this seems to be able 
to find all city/restaurant pairs in a uniform way, independent of the structural 
differences. Unfortunately, this does not work (when ra is applied to the database 
of Figure 1: one of the valid substitutions would map Z to the left child of the 
root, which allows the last two path expressions in the body of ra to be mapped 
to the paths corresponding to the city Montreal and the restaurant Sushishop. 
This produces the pair (Montreal, Sushishop), an obviously incorrect answer. 
It is not clear whether one can write a fixed query expression in this language 
(or, for that matter, in any of the languages mentioned above) that can correctly 
find all city/restaurant pairs in a representation independent way.^ 

A possible objection against representation independence would be the fol- 
lowing. Consider two nested relational databases X and J containing “equiv- 
alent” information on restaurants and cities, where X groups the information 
by city, whereas J groups by restaurant names. A query such as “find all 
city/restaurant pairs in fiat form” would require two different expressions against 
X and J in nested relational algebra. If this is acceptable, isn’t it unreasonable 
to impose the requirement of representation independence on SSDB query lan- 
guages? The difference is that the nested relational model, like all traditional 
data models, has a well-defined notion of a predefined schema. Thus, the user 
can be reasonably expected to know the schema in full, which includes, among 
other things, structural information such as nesting order. Thus, we can live with 
representation dependence. By contrast, as argued by Abiteboul [4], it is unrea- 
sonable to assume a predefined schema or a complete structural/representational 
knowledge on the part of the user, for SSDBs. In this context, we believe that 
representation independence is not only reasonable, but is an extremely impor- 
tant requirement for a SSDB query language to be practical. We can trace the 
representation dependent nature of most prominent languages to their heavy 
reliance on navigation (e.g., via path expressions). Following Ullman [30], we 
contend that a high-level query language should enable the user to write queries 
without having to worry about navigational details. 

^ It is even less obvious whether representation independent expressions exist in these 
languages for all queries, since the issue is not addressed in those papers. 
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Another source of bewilderment in writing SSDB queries is that the propo- 
sitions that are supposed to be expressed by a database are not clearly captured 
by existing SS data models, as these models are largely syntactic. In a traditional 
model like the relational model, the facts are the tuples in the various relations. 
Similarly, the fact structure for other traditional models is precisely captured by 
the data model. In the case of SSDBs, we note from the literature that the facts 
can sometimes be in the paths [28,6] and sometimes in certain subtrees (as in the 
example above; also see [17,11]). The consequence of the model not capturing 
the fact structure is that the burden of finding it is shifted to the query expres- 
sions the user has to write. Of course, here a possible objection would claim 
that there is no fact structure - that is why the database is “semi-structured.” 
Since there is no fact structure, navigation-based querying is as good as any 
method. But if the user is able to navigate, he/she has to know the “schema” 
and since the “schema” is unknown a priori, the user must have seen the data 
in order to have seen the “schema.” In this case, a navigational query seems to 
offer little additional advantage over further browsing by the user. And if the 
“schema”, however primitive, is known a priori, then the database does express 
some propositions after all. 

Our work was motivated by the following questions. Can we make a self- 
describing light-weight model such as OEM also capture the fact structure in a 
SSDB? What is an appropriate notion of genericity, and what is an appropriate 
definition of a query for SSDBs, in a formal sense? Can we design SSDB query 
languages that are representation independent and are “generic?” These are 
crucial questions that need to be answered before useful query languages can be 
designed and implemented, before the goal of repository independence [21] can be 
achieved, and before advances in areas such as expressive power and completeness 
can be made. Our main contributions can be summarized as follows: 

1. We formalize the notion of representation independence and show that it 
is an important property for a SSDB query language. We also introduce 
a notion of representation preservation which intuitively says that queries 
should perturb existing structure as little as possible^. 

2. We take the core intuitions behind the OEM model and extend the model 
with a few minor amendments, while preserving its light-weight nature. We 
make precise the fact structure of a SSDB and give a formal definition of the 
contents of a SSDB. 

3. We define the concept of genericity for SSDB’s. Roughly speaking, generic- 
ity means that queries should be representation independent and should not 
interpret domain values. We give two possible notions of schema - a conser- 
vative predefined schema similar in spirit to that of traditional data models, 
and a liberal schema, which is essentially a finite set of attributes, not known 
in advance. We also extend the classical notion of BP-completeness [27,8] 
to the SSDB setting. 

^ This is not an argument against restructuring. On the contrary, we contend that 
restructuring should happen by design; not “accidentally” as a result of querying. 
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4. We define a calculus called component calculus (CC) and an algebra called 
component algebra (CA) and show that: 

(a) CC and CA are equivalent in expressive power. 

(b) CC is representation independent but not representation preserving, 
while CA is both. 

(c) CC and CA are BP-complete in our extended sense. 

(d) In comparison of CC with TRC-*~, an extension of classical tuple rela- 
tional calculus (TRC) to accommodate inapplicable nulls, it turns out 
that there are queries expressible in CC for which there is no equivalent 
expression in TRC"*" whose size is polynomially bounded in the size of 
the conservative schema. When the schema is liberal, there are queries 
expressible in CC which have no equivalent expressions in TRC"*". 

2 Semi-structured Databases 

2.1 Schema and Instance 

Let JV, A, and V be countable pairwise disjoint sets of nodes {i;, u, w, Ui, Vj, . . .}, 
attribute names {Ai, . . . A„, . . .}, and values {a, 6, c, ...}. In general, we may 
know the “schema” of a semi-structured database either completely, partially, 
or not at all. However, in practice, when dealing with such databases, based 
on the application domain, it is sometimes reasonable to assume we at least 
know what kind of attributes to expect, if not how/where to find them. To 
reflect this, we use two alternative notions of schema in this paper. In the most 
general scenario, the schema of a semi-structured database can be completely 
instance dependent, and thus can potentially be any finite subset of A, which 
is not known a priori. This is called the liberal schema, and is perhaps the most 
realistic notion. A second, restrictive notion is that of a conservative schema, 
which is a predetermined finite subset A C A of attributes. This notion may 
be appropriate when dealing with highly regular databases such as, e.g., bibtex 
databases (possibly in a fixed source such as [25]), set of machine generated 
homepages of employees in an organization, etc. 

A database tree is a 5-tuple T = {V, E, attr, val, label), such that: 

- R C Af is a finite set of nodes. 

- if C R X R, is a set of edges such that the graph {V, E) is a rooted tree. 

- attr : V — *■ A is a partial function that maps each node v € V to its attribute 
attr{v). When attr{v) is undefined, we write attr{v) = T. We refer to a 
node u such that attriu) = T as a null node. 

- val : R— >V is a partial function that maps each node v € V to its value 
val{v). When val{v) is undefined, we write val{v) = T. We assume that if 
attr(v) = T, then necessarily val{v) = T. 

- label : maps each node to a subset of labels from {©, 0}. 

Sometimes we will assume that the trees are ordered. For this we use a 
function co that associates with each node in R a total order on its outgoing 
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edges (see e.g. Beeri and Milo [9]). In essence then, a database tree is an ordered 
rooted tree whose nodes come with an attribute- value pair {A : a), together 
with optional labels 0,0. Either the value a, or both the attribute A and the 
value a could be null. The labels 0, 0 are called blockers, and they enable the 
query processor in not mixing up associations between different attribute-value 
pairs, thus capturing the inherent fact structure. Conceptually, this corresponds 
to transferring the navigational knowledge imposed on the user by navigational 
languages, to the data model and the source that implements it. Figure 2 shows 
three database trees. 

Informally, a fact of a database tree T is any set {(Ai : oi), . . . , {An : a„)} 
of attribute-value pairs from nodes in a subtree of T, such that no attribute is 
repeated and the subtree does not cross any of the blockers 0, or 0. Some of 
the values Gi might be the null _L. Intuitively, fact {(^i : ai),...,(A„ : a„)} 
represents the tuple [Ai : ai, . . . , An : a„]. For instance, in Figure 2, the tree Ti 
contains the facts {(A : a), {B : 6), (C : _L), {D : _L)} and {{A : o'), {B : 6'), {D : 
_L)}, while T 2 contains the facts {(A : a),{B : b), {E : _L)} and {{A : a'), {B : 
b'),{C : _L)}. Note that the fact {{A : a),{B : b),{C : -L),{D : _L)} in T\ is 
“subsumed” by the fact {{A : a), {B : b)}. Tree T 3 in Figure 2 contains the facts 
{{A' : &), {B' : a), {F : T)} and {{A' : b'), {B' : o'), {F : T)}. Intuitively, all three 
trees are “equivalent” in terms of the set of facts they represent. This intuition is 
captured in Definition 1, where “content equivalence” captures equivalence w.r.t. 
the underlying set of facts, and the other notions of equivalence there capture 
the various dissimilarities among the tree of Figure 2. 




A : a 
B : b 

E:_L 
A : a’ 

B : b’ 



Tree T, 



© 



A’ : b’ 

B’ : a’ 

F: _L 

A’ : b 
B’ : a 

Tree To 



0 



Fig. 2. Three Database Trees 



We can see from the example that 0 may be regarded as a kind of “light- 
weight” set node.^ The role of 0 is to act as a scoping boundary for attribute 
names. For instance, if Name denotes the name of an employee, and the children’s 
names of the employee are encoded in a subtree, we can use the blocker 0 at 
the root of the subtree to bound the scope of the two different occurences of the 
attribute Name. A tree encoding the facts that Joe is an employee with address 

® Notice that the model permits heterogeneous sets in that different tuples can have 
different sets of attributes. 
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Westmount, and that he has two children, Sam and Brigitte, aged 5 and 3 
resp. is shown in Figure 3. The nodes with _L as values can be thought of as 
encoding objects. Thus EMP is an object, and Children is a subobject of Emp. 
Thus, intuitively, use of the blocker © can be viewed as declaring a new object 
(corresponding to the subtree) and a reference to the subobject in the parent 
object. 




A semi-structured database instance (SSDB) is a finite sequence of database 
trees I = (Ti, . . . , T^). An example database appears in Figure 4 in Section 3.3. 
When we consider instances of a predefined schema A, we will assume that the 
range of the function attr is included in A. The active domain of a SSDB I, 
denoted adom{I), is the set of all values associated with nodes in X’s trees. 

2.2 Contents of Databases 

It is clear that the same information can be represented as a semi-structured 
database in several “equivalent” ways. In this section, we formalize four notions 
of equivalence between SSDBs. Below, we informally explain these notions, first 
between trees. 

— Content equivalence: Although two trees might have different “looks” they 
nevertheless represent the same “generic” information, where attribute 
names and constants are not interpreted. 

— Subsumption equivalence (special case of content equivalence): The two trees, 
in addition, use the same constant and attribute names in the same contexts. 

— Partial isomorphism equivalence: Full isomrphism would mean that the trees 
are isomorphic as labeled ordered trees and have the same generic informa- 
tion represented in structurally similar ways. However, since we will choose 
to disregard nodes with null values T in the isomorphism, we work with 
partial isomorphism. 
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— Partial identity (special case of partial isomorphism equivalence): The trees 

are identical, except for nodes with null values _L. 

We next formalize these notions. 

A path in a database tree T, is a sequence V\,V 2 , ■ ■ ■ ,Vk of nodes related 
through E U E~^, subject to the following restrictions: (i) whenever Vi,Vj,Vm 
are three consecutive nodes on the path, and (vi,Vj) G E~^ and (vj,Vm) G E, 
then CD ^ label{vj)j and (ii) if u,, Uj, Vm are three consecutive nodes on the path, 
and (vi,Vj), (vj,Vm) G E, then © ^ label{vj). In tree Ti, Figure 2, the sequences 
of nodes corresponding to (A : a),{C : E),{B : b) and that corresponding to 
{D : -L),(C : -L),(A : a) are both paths. A path is simple provided no node 
repeats along the path. Nodes v and w are connected if there is a simple path 
between them. A set of nodes is connected if every pair of nodes in the set is 
connected (by a simple path). A connected component {component, for short) of 
a tree T is a subtree of T induced by a maximal connected subset of nodes of T. 
Note that unlike in classical graph theory, two components can overlap. The set 
of attribute-value pairs associated with the nodes of any component of T, such 
that no attribute repeats, is a fact in T. More than one fact may thus come from 
the same component (if it contains repeated attributes). Thus, the information 
content of T is the set of facts in T. 

Let T = (V, E, attr,val, label) and T' = {V' ,E' ,attr' ,val' , label') be 
database trees, and let U = {v G V ■. val{v) yC T}. A mapping h ■. U ^ V 
is called a tree morphism if for all nodes Vi and vj in U , the following holds: (i) if 
attr{vi) = attr{vj) then attr' {h{vi)) = attr' {h{vj)); (ii) if val{vi) = val{vj) then 
val'{h{vi)) = val' {h{vj)). We sometimes denote a tree morphism as h : T— s-T'. 
Basically, tree morphisms are consistent w.r.t. the attributes and values associ- 
ated with nodes. 

Let A{T) (resp., V(T)) denote the set of attribute names (resp., values) of 
the nodes in T. Then a tree morphism h : T ^ T' induces the (partial) attribute 
mapping hattr ■ A{T)^A{T'), defined as hattr{A) = attr'{h{v)), where v is any 
node in V for which attr{v) = A. Similarly, it induces a value mapping h^ai '■ 
V{T)^V{T'), defined as hyai{a) = val'{h{v)), where v is any node in V for which 
val{v) = a. Clearly, these mappings are well defined. 

Definition 1. A tree morphism h : T— s-T' is called a content embedding if the 
following conditions hold: (i) the attribute mapping hattr induced by h is in- 
jective; (ii) the value mapping hyai induced by h is injective; (iii) whenever Vi 
and Vj are connected in T, then h{vi) and h{vj) are connected in T' . □ 

If there is a content embedding from T to T' , we say that T is content- 
homomorphic to T' and write T ^contT'. If T and T' are content-homomorphic 
to each other, we say they are content equivalent, i.e. T =cont T' iff T <cont T' and 
T' <cont T- We will also be interested in special content embeddings h, which are 
content embeddings for which the induced mappings hattr and hyai are identity 
mappings. In case there is a content embedding from T to T' via a special 
mapping h : T T' , we say that T is subsumed by T' and write T <sub T' . If T 
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and T' are subsumed by each other, we say they are subsumption equivalent, i.e. 
T =sub T' iff T <sub T' and T' <s^h T. 

For nodes u, v in a tree T, u preorder precedes v provided u comes before v 
in the preorder enumeration of T’s nodes. It is extended to a SSDB instance X 
in the obvious way: u preorder precedes v whenever both come from the same 
tree and the above condition holds, or u (v) comes from Ti (Tj) and Ti comes 
before Tj in the standard enumeration of the trees in X. 

The next notion we introduce is partial isomorphism equivalence. Thereto, 
we say that a tree morphism h : T—fT' is a tree embedding, if it is a content 
embedding and further, whenever both u and v have non-null values, if (u, v) € 
E, then {h(u), h{v)) S E' , and if u preorder preserves v in T, then h(u) preorder 
precedes h{v) in T' . If there is a tree embedding from T to T' we write T <piso T' , 
since, loosely speaking, a subtree of T is isomorphic to a subtree of T' . If both 
T <pisoT' and T' <^{soT, then T and T' are partial isomorphism equivalent, 
denoted T=pisoT'. The final notion is partial identity. Suppose there is a tree 
embedding h from T to T' , such that h is special. Then, informally speaking, 
the non- null portion of T is identical to a subtree of T', so we write T <pid T' . 
If both T <pid T' and T' <pid T, then T and T' are partially identical, denoted 
T=pidT'. The various notions of homomorphism are related as follows. 

Lemma 1. (1) If T <snh T' then T <cont T' . (2) If T <pM T' then T <piso T' . (3) 
If T <piso T' then T <cont T' . (4) None of the reverse implications holds. 

These notions of equivalence are illustrated in Figure 2 from Section 2.1. 
For the trees in the figure, we have Ti=subT 2 and Ti=cont? 3 - We also 
have T 2 =piso I 3 . Clearly Ti ^piso T 3 . 

Finally, let <* denote any of the notions of embedding above. For two 
database instances X, J , we define X<^J provided for all trees T G X, there is 
a tree T' G J such that T <* T'. X=^J iff X <* J' and J <* X. 

In Section 3.2, in defining operators of our algebra, we will often need to 
create identical copies of trees or components. An identical copy of a database 
tree T is another database tree T' such that there is a special embedding h : 
T— s-T' which is 1-1 and onto on the nodes of T, and preserves node labels. The 
following theorem characterizes the complexity of checking the different notions 
of embedding. 

Theorem 1. Deciding whether T <cont T' is NP-complete. Deciding whether 
X=^J can be done in PTIME for =* G {=piso, =sub, =pid}- 

3 Querying Semi-structured Databases 

In the classical setting of relational databases, queries are defined as computable 
functions from database instances of a given schema to relations over another 
given schema, that are generic, in the sense of commuting with database iso- 
morphisms. For SSDBs, a complication arises because instances do not come 
with a predefined schema in general. The “right” definition of queries depends 
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on what one perceives as the information content of a SSDB. E.g., consider 
Figure 4 (Section 3.3), which shows two database trees T\ and T 2 representing 
the resources and industries in various Canadian provinces and cities. In Ti, 
one can argue that the fact that provinces come below regions is significant. A 
problem with this position, however is that a different presentation of the infor- 
mation in Figure 4 might simply list the information for each province in the 
form of a set of tuples over region, province, capital, population, and resource. 
In a SSDB context, it is unreasonable to assume complete user knowledge of the 
exact structural organization of information. We postulate that the right level 
of abstraction of information content in a SSDB is the set of associations be- 
tween various attribute values as captured by the paths connecting them in the 
SSDB. In keeping with this, queries should not distinguish between two SSDBs 
that are considered equivalent w.r.t. their contents. Let X be any SSDB, and tt 
any permutation on its active domain adomiX). Then we extend tt to I in the 
obvious manner: 7r(X) is the database forest obtained by replacing all non-null 
attribute values a by 7r(a). Then we have 

Definition 2. A query is a computable function Q from the set of SSDBs to the 
set of database trees such that: (i) Q is invariant under permutations of the active 
domain, i.e. for any instance X and any permutation tt on its active domain, 
Q{7r{X)} =pid tt{Q(X)); and (ii) whenever X =cont J, then Q(X) =cont Q{J)- □ 

As illustrated in Section 1, one of the desirable properties for a query language 
for SSDBs is the “robustness” with which it can accommodate modifications 
to the structural presentation of data while the contents are left intact. More 
formally, we have 

Definition 3. A query language C for SSDBs is representation independent (ri) 
provided for every expression E in C, and for any two SSDBs X and J., X =cont J 
implies E{X) =cont e\j).^ □ 

A user of a representation independent query language can remain oblivious 
to differences in the actual representation of a given piece of information. None 
of the langauges surveyed in Section 1 is representation independent. 

A second desirable property of query languages is representation preserva- 
tion. Intuitively, it means the expressions in the language should perturb the 
original structure present among elements of the output, as little as possible. 
We believe queries expressed in a representation preserving language, requiring 
minimal structural alterations, will lead to efficient implementation. We distin- 
guish between querying and restructuring in the sense of [20] and postulate that 
the spirit of pure querying is captured by representation preservation. 

Definition 4. A query language C is representation preserving (rp) provided 
for every expression E in C, and for every instance X, there is a partial isomorphic 
embedding of E{X) into X, that is, E{X) <piso21. □ 

^ Note that this is stronger than saying that for every qnery Q expressible in C, there 
is an expression E in £ that computes Q and returns eqnivalent answers on content 
equivalent databases. 
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3.1 Component Calculus 

In our model for semi-structured data, we regard the connected components of 
the database forest as the basic information carrying units which define the fact 
structure of the database. In the component calculus, we give primacy to these 
components, by letting the variables range over them. Let V be a countable 
set {s,t, ...} of component variables, or variables for short. To accommodate 
inapplicable nulls which are natural in SSDBs, following Manna [22], we use 
two kinds of equalities, namely = and =. An atom of component calculus is 
an expression of the form® t.A = a, t.A = a, t.A = _L, t.A = s.B, t.A = s.B, tQs, 
t Ea I or T(t), where T is a tree, t and s are variables, A and B are attribute 
names, and a is a domain value. The set of all well-formed formulas of component 
calculus is formed by closing the atoms under A, and 3 in the usual way. We 
will use the shorthands: t.A yf a for ^{t.A = a), t.A for ~^{t.A = a), etc. 
We will use the special abbreviation t\A\ for t.A ^T, which in turn abbreviates 
^(t.A = T). 

Throughout the paper, when we construct new trees, by a “new” node, we 
mean a node that appears nowhere else before. A tree is said to be multiple-free 
w.r.t. a set of attributes X if no attribute in X occurs more than once in T. 
It is multiple- free if no attribute occurs more than once. Let T be a tree. By 
the universe of T we mean the set of all multiple-free line trees, constructed in 
such a way that each attribute-value pair in it is equal to some attribute-value 
pair in T, and the line trees do not contain null nodes. The notion of universe 
is extended in the obvious way to a database instance X. Furhtermore, let ip 
be a formula in component calculus, and X and instance. Then the universe of 
(X, p) is defined as the universe of X except that it additionally has components 
obtained as folllows: for each line tree r in the universe of X, rename any subset 
of attributes of r with distinct attributes appearing in p. 

Let 0 be a valuation into X, i.e. a mapping from the variable set V to the 
universe of X. We now define satisfaction of a component calculus formula p by 
a database X under a valuation 9, in symbols X |= p6. 

X \= {t.A = a)9, if attr(v) = A and val{v) = a, for some node v in 9{t). 

X ^ {t.A = s.B)6, if attr{u) = A, attr{v) = B, val{u) and val{v) are defined, 
and val{u) = val{v), for some nodes u in 9{t) and v in 9{s). 

X 1= {t.A = a)9, if attr{v) = A, and either val{v) is defined, a is a domain 
value, and val{v) = a, or val{v) is undefined and a is T, for some node v in 
9{t). 

X ^ {t.A = s.B)9, if attr{u) = A, attr{v) = B, and either val{u) and val{v) 
are defined, and val{u) = val{v), or val{u) and val{v) are both undefined, 
for some nodes u in 9{t) and v in 6{s). 

X ^ {t[a])9, if val{v) = a, for some node v in 9{t). 
l^{t\Zs )9, if 9{t) <,ub 9{s). 

® For expressive power issues, following standard practice, we focus attention on the 
fragment of CC without the atoms t.A = a and t.A = a. 
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X \= (t s)6, if r <sub where r is a < sub-maximal component without 

any occurences of attribute A, such that r<sub 

ihmm =sub C for some component C of T. 

The semantics of A and ^ is as usual, and for quantification we need to say that 
0(t/r) is the valuation 9 except that it maps the variable t to the line tree r in 
the universe of (X, ip) . Then 

X 1 = {3tip)6, if X 1 = pBitjT), for some t in the universe of (X, p). 

It will be convenient to work with finite subuniverses of a pair (T, varphi) . 
Let fc be a natural number. Then the k-universe of (T, p) is the subuniverse of 
(T, p) consisting of line trees with at most k nodes. The definition of the k- 
universe of the pair (X, p) for an instance X is similar. 

A component calculus query expression E is an expression of the form (t\p{t)), 
where p(t) is a component calculus formula with t as its only free variable. Let 
the length (in number of symbols) of E be k. When applied to an instance 
X, E defines a tree E{X) with a “new” node v as its root, where attr{v) = T 
val{v) = T, and label{v) = 0. The subtrees of the root are determined as 
follows: Let 9i, . . . ,9q be all assignments into the fc-universe of X such that for 
i S [ 1 , 9 ], X 1= p{t)9i. Now take the forest {9i{t)}f^^ and reduce this forest 
under <sub- If there are several equivalent < sub-minimal line trees, choose one 
representative of each equivalence class, say the line tree that has the smaller 
attributes (in some standard enumeration of the attributes) higher up. This will 
yield a subsumption equivalent forest {Ti, . . . ,Tp},p < q. The root then has 
subtrees Xi, . . . Tp. Sample queries can be found in Section 3.3. 

3.2 The Component Algebra 

In this section, we define an algebra called the component algebra (CA), whose 
expressions map SSDBs to database trees. This is achieved by extending the 
classical relational operators to the SSDB setting. We shall find it convenient 
to have the notion of a canonical form an instance. First we recall that a null 
node is a node u where both attr(u) and val{u) equal T. A database instance 
X is in canonical form if every node in X that has a blocker is a null node. It 
turns out that every instance X can be easily converted into an “equivalent” 
instance that is in canonical form, as follows. (1) For every non- null node u with 
attribute-value pair A : a ^ and a horisontal blocker, replace u with three nodes 
a;, y, z such that: (i) x and z both have the attribute-value pair A : a associated 
with them, while y is a null node; (ii) the parent of u becomes the parent of x, 
while every child of u becomes a child of z; (iii) y gets a horisontal blocker, 
is the only child of x, and is the parent of z; and (iv) if u additionally had a 
vertical blocker, then z gets a vertical blocker. (2) For every non-null node u with 
attribute- value pair A : a and a vertical blocker, let vi, ..., Vn be the children of u. 
Then replace u with n -|- 1 nodes r, Xi, ..., Xn such that: (i) Xi is a child of r, has 



a can be T. 
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attribute-value pair A : a, and is the parent of u,, i = 1, (ii) r is a null 
node, has a vertical blocker, and is a child of u’s parent. It is straightforward 
to verify that the instance so constructed is partially isomorphic to I. In the 
sequel, we shall assume without loss of generality that the instances we consider 
are in canonical form. 

We begin with the simplest operator, union. Let Ti and T 2 be any database 
trees (which may be the result of applying CA expressions to a database) . Let T/ 
denote a copy of T^, i = 1, 2. Then Ti U T 2 is the tree T with a new null node u 
as its root, with label{u) = {CD}, which has the roots of T{ and T 2 as its children, 
in that order. Note that the operands are not required to be union-compatible, 
as such an assumption would be inappropriate for the SSDB setting. 

Each of our remaining operators can be conceptually regarded as involving 
two steps. ^ First, one marks a set of nodes and erases the attribute/ value pairs 
associated with them. Second, one makes the resulting tree compact by con- 
tracting “redundant” nodes. For conceptual clarity, we thus define these auxil- 
iary “operations” first. Let T = {V, E , label, attr,val) be a database tree and N 
a subset of T’s nodes. Then erase(N, T) is obtained by taking an identical 
copy of T and erasing for each node u G N, the attribute/ value pair associ- 
ated with u. More precisely, let T' = {V' , E ' , label' ,attr' ,val') be an identical 
copy of T, h : T^T' be an 1-1 onto special tree embedding from T to T' , and 
let h{N) = {h{u) I u G N}. Then erase(7V, T) is the database tree T” = 
{V' , E ' , attr” , vaE\ label') , such that Vu S h{N), attr^' {u) = vaV' {u) = T, and 
Vu G V' — h{N), attr"{u) = attr'{u), and vaE'{u) = val'{u). In the sequel, we 
will abuse terminology and say erase a node to mean erase the attribute/ value 
pair associated with it. 

The second auxiliary operator is reduce. Applying it to a database tree 
results in a tree which has no more nodes than T, but which is content equiv- 
alent to it. Intuitively, whenever T contains a parent node u and a null child 
node V, both belonging to a common component, then u and v can identified 
(i.e. merged). This will leave the contents unchanged as long as the child (null) 
node does not have any associated horisontal blocker.® Formally, reduce(T) is 
obtained by repeatedly applying the following step to an identical copy T' of T, 
until there is no change: whenever T' has a null node u whose parent is v, and 
u,v belong to a common component, and u does not have a horisontal blocker, 
identify (i.e. merge) u and v. Note that this step only applies to null nodes dif- 
ferent from the root, and also that the result of applying reduce to T is always 
unique. 

Renaming is defined in a straightforward manner. Given a tree T, the opera- 
tor pb^a{T) replaces every occurrence of an attribute A (at any node of T) by 
attribute B. When there is no occurrence, this is a nop. 

^ Join also involves something more, of course. 

® Identifying two adjacent nodes is similar to the edge contraction operation in the 
theory of graph minors. 
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The allowable selection conditions in CA are of one of the following forms:® (i) 
A9a, (ii) A9B, (iii) A = _L, _L, and (iv) _ = a, where A, B are attributes, a is 

a value, and 9 is one of =, 7 ^,=,^, and their boolean combinations. We include 
conditions of form (iv) since in a SSDB where a predefined schema may not 
exist, it is often convenient to search for occurrence of values without knowing a 
priori their associated attributes.^® Let T be a database tree and C any allowable 
selection condition of the form, say A9B. Then a component C of T satisfies this 
selection condition exactly when C contains two nodes u,v such that attr(u) = 
A, attr(v) = B, and val{u) and val{v) stand in the relationship indicated, as 
per the definition of satisfaction given for CC. Satisfaction of other conditions 
is similar. In particular, C satisfies - = a, provided C contains some node with 
associated value a. Our definition of selection follows. Recall that we assume 
database trees are in canonical form. 

Definition 5. Let T be a database tree, C and any allowable selection condi- 
tion. Then 

(J^iT) = reduce(erase({m S T I m belongs to component C of T & 

C does not satisfy C},T)). 

Intuitively, selection erases the attribute/ value information on all nodes be- 
longing to those components of T which do not satisfy C and reduces the resulting 
tree. 

Definition 6. Let T be a database tree, X any set of attributes. Then 

7Tx{T) = reduce(erase({u € T I attr{u) ^ X},T)). 

Projection on a set of attributes X erases all nodes whose attribute is not 
in X, and reduces the resulting tree. When we work with liberal schema, we will 
find it convenient to use dual projection, which amounts to dropping attributes. 

Definition 7. Let T be a database tree, A any attribute. Then 

tta{T) = REDUCe(erase({m e T I attr{u) = A},T)). 

This operator erases all nodes with associated attribute A (if any) and reduces 
the resulting tree. 

The minus operator is based on the notion of subsumption. A component C 
of a tree T is subsumed by a component C" of a tree T' exactly when C as a 
tree is subsumed by C (cf. Section 2.2). 

Definition 8. Let T\,T 2 be any database trees. Then 

Ti —T 2 = REDUCe(erase({m e C I C is a component of T & 

3 a component C' of T 2 : C <sub C'},Ti)). 

® As mentioned for CG, when considering expressive power issues, we will drop form (i). 
This feature is exhibited by previous SSDB languages. 
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Basically, Ti — is obtained by erasing (the nodes in) those components 
of T\ which are subsumed by some component of T 2 and then reducing the 
resulting tree. The last operator is join. Since we are operating in a model where 
we use attribute names (as opposed to position numbers as in classical relational 
model), it is more appropriate to define join as a primitive operator. Thereto, 
we need the following notions. 

The preorder precedes relation Section 2 is extended to components of trees: 
a component C of T preorder precedes another component C of T provided for 
every non- null node v in C' , there is a non- null node m in C such that u preorder 
precedes v. Recall that components of trees in canonical form never share non- 
null nodes. Consequently, on such trees, the preorder precedes relation is a total 
order on components. 

Let T and T' be any trees. Then we can form a new tree T 0 T' by taking 
a new node v as root, with attr{v) = T, val{v) = T, and label{v) = {}, and by 
making copies of T and T' subtrees of v, in that order. 

For an enumeration of trees Ti,...,T„, we abuse notation and let 
0(Ti,...,T„) denote the tree with a new null node u as its root with label 
0, and the roots of copies of Ti, . . . , T„ as its children, in that order. 

Let C be a component of a tree T and let X be a set of attributes appearing 
in C. Suppose C has one or more nodes corresponding to each of the attributes 
in X. A cutset of C w.r.t. X is any set of nodes Vx in C such that there is 
exactly one node in Vx corresponding to each attribute in X. Clearly, a cutset 
need not be unique. Each cutset Vx leads to a component which is multiple-free 
w.r.t. X, as follows. Let V^ = {u \ u € nodes(C) — Vx & attr{u) G A}, i.e. Vx 
consists of precisely those nodes in C not in Vx, which contain an attribute 
in X . Then reduce(erase(R^, C)) denotes the component obtained from C by 
erasing all nodes outside Vx which contain an attribute in X, and then reducing 
the resulting component. Finally, the decomposition of C w.r.t. X is defined 
as DECOMP(C', A) = {REDUCE(ERASE(y;|, C)) | Vx is a cutset of C w.r.t. A}. 
Clearly, when C is multiple-free w.r.t. A, decomp(C', A) = C. 

Let C, C be any components of trees Ti, 02- Suppose C (resp., C) is multiple- 
free w.r.t. A (resp., B). Then the pair (C, C") satisfies the condition A = B pro- 
vided there is a pair of nodes u G C and v G C whose associated attributes are A 
and B respectively, associated values are defined, and are equal. Satisfaction for 
A ^ B, A = B, A^ B is similarly defined. In particular, the pair (C, C') satisfies 
A = B provided there is a pair of nodes u G C and v & C whose associated 
attributes are A and B respectively, and either both their associated values are 
undefined, or both are defined and equal. Whenever {C, C) satisfies A9B, we 
say {C,C) is (theta-)joinable. When C and/or C is not multiple-free w.r.t the 
appropriate attribute, we work with the components in their decomposition. Let 
DECOMP(C', A) = {Cl, . . . , Cfe}, where Ci preorder precedes Ci+i,l < i < k, 
and DECOMP(C', B) = |C(, . . . , C'^}, where and C{ preorder precedes C'_|_^, 1 < 
j < m. Then define C *a 0 b C = 0(Ci © C' | Ci and C' are joinable ). Here, 
the pairs of components are enumerated in a manner that respects the preorder 
sequence. 
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Let C be a component of Ti and let T2 be any tree. Then C ?2 = 

©(C *a9b Di, . . . ,C *a 8 b Dn) where Z?i, . . . Dn are all the components of T2 
and Di preorder precedes 1 < i < n. For a tree T and a binary relation 

r = {(C, Z?) I C is a component of T,D is any tree}, let repl(T, r) denote the 
result of replacing, in a copy of T, (the corresponding copy of) each component C 
of T, by the tree D, whenever r{C,D) holds. 

Definition 9. Let T\ and T 2 be any database trees, and A G A{Ti)^B G 
A{T2) be any attributes appearing in them. Then T\ ^abb T2, where 9 is 
one of =,yf,=, /=, is defined as Ti cxiAes ?2 = repl(Ti, {(C, C T2) | 
C is a component of Tij). 

In words, join works as follows. Create a copy T[ of T\. For each component C 
of T[ do the following: (i) concatenate (in the sense of ©) C with each compo- 
nent C of T2 with which it is joinable; (ii) then “merge” the resulting set of 
trees using the © operator above; this gives rise to the tree above; (iii) 

next, replace C by the tree C *a6bT2 in T{. A technicality arises due to the fact 
that C (resp., C") may not be multiple-free w.r.t. attribute A (resp., B), which 
is handled by decomposing C (C) w.r.t. A (B) and “processing” the resulting 
pieces w.r.t. their preorder enumeration. 

We note that the definition of natural join is analogous. The main issue, 
however, is when we can consider a pair of components natural joinable. There 
are several ways to define this, depending on which notion of equality is adopted. 
We argue that when no notion of predefined schema is applicable (as is common 
with most SSDB applications), practically the most likely useful definition is 
one that says whenever both components have defined values for any common 
attribute, then they must agree. We note that this definition of natural join 
is in the spirit of the polymorphic join defined by Buneman and Ohori [12]. 
Formally, we say that a pair (C, C') of components of T\,T2 respectively, is 
natural joinable, provided for every attribute A, if there are nodes u € C and 
V G C such that their associated attribute is A and their associated values are 
both defined, then they are equal. Define the notions C-kC , CkT2 analogously to 
C kA9B C and C kA9BT2, but incorporating the condition for natural joinability 
instead of theta-joinability. Necessary decompositions must be done w.r.t. the 
set of common defined attributes for the pair of components being joined. We 
have the following definition of natural join. 

Definition 10. Let T\ and T 2 be any database trees. Then 

©1 [XI ©2 = REPl(Ti, {(C, C *©2) I C is a component of ©1}). 

Samples of algebraic queries can be found below in the next section. 



3.3 Examples 

Here we illustrate the query languages. Consider the following database for- 
est representing information on Canadian provinces (Figure 4). In tree ©1, the 
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Tree T 
1 





Fig. 4. A Canadiana Forest 



provinces are grouped into two subtrees: the Eastern provinces and the Mar- 
itimes. For each province, the database records its name. PEI stands for Prince 
Edward Island. For some provinces the main resources (RES) are recorded, and 
likewise for the capital (CAP), and population (POP). Only for New Foundland 
is the main import goods (IMP) recorded. The tree T 2 shows major cities and 
industries in each province. 

Consider now the following queries. 

Query 1 Suppose we want the name and capital of all recorded provinces in 
tree Ti. Note that the user does not necessarily have any knowledge about the 
structure of the database. This query is expressed by 

{t\3s: Ti{s) A t.PRDV = s.PROV A ACAP = u.CAP). 
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This CC expression will return (a tree with) the following set of components 
(as branches): {(PRQV : Ontario, CAP : Toronto), (PRQV : Quebec, CAP : QuebecCity), 
(PROV : PEI, CAP : Charlottetown), (PRQV : NovaScotia), (PROV : NewFoundland)}. In 
component algebra this query is formulated as 

7TpR0V,CAP (cTpRov ^ J_ (Tl ) ) . 

Note that the Component Algebra query would return a “substructure” of Tp, 
induced by the PROV and CAP nodes (if any), as the answer. 

Query 2 List those provinces in T\ that are in any way associated with wheat, 
and their capitals. 

{t\3s:Ti{s) A t. PROV = s. PROV A t.CAP = s.CAP A s[wheat]). 

The answer will include the following components {(PROV : Quebec, 
CAP : QuebecCity), (PROV : NewFoundland)}. In component algebra, this query can 
be expressed as 

ti'pro V ,CAp(o'pnQv^j_,p_=wheat(^i))- 

Query 3 List all information about those provinces in Tp for which the capital 
and population is recorded 

(t|Tp(t) A t[PR0V] A f[CAP] A t[P0P]). 

The answer will be {(PROV : Ontario, CAP : Toronto, RES : Oil, POP : 5), 
{(PROV : Ontario, CAP : Toronto, RES : Asbestos, POP : 5), (PROV : PEI, 

CAP : Charlottetown, POP : 0.1)} In algebra this would be written as 

'^PRDV # _LACAP # _LAP0P # _L (Ti). 

Query 4 Join the information in Tp and T 2 . 

{t I 3u, V : Tiiu) AT2{v) A u C t A G t). 

The answer will contain the components {(UNIT : QC, PROV : Quebec, 
CAP : QuebecCity, RES : Asbestos, CITY : Montreal, IND : Aerospace), 

(UNIT : QC, PROV : Quebec, CAP : QuebecCity, RES : Wheat, CITY : Montreal, 

IND : Aerospace), (UNIT : QC, PROV : Quebec, CAP : QuebecCity, RES : Asbestos, 
CITY : QuebecCity, IND : Tourism)} (UNIT : QC, PROV : Quebec, CAP : QuebecCity, 
RES : Wheat, CITY : QuebecCity, IND : Tourism)} In Component Algebra we would 
simply write 

Tp IX 1 T 2 . 

The result of the algebraic join is shown in Figure 5 
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Tree Tj txi T2 




Fig. 5. The result of a natural join 



4 On the Expressive Power of Component Calcnlus and 
Algebra 

4.1 Calculus vs. Algebra 

In this section, we study the expressive power of Component Calculus/ Algebra. 
First, we establish that while CC is representation independent but not repre- 
sentation preserving, CA is both. Next we show that CC is instance complete. 
Instance completeness is a notion that generalizes the BP-completeness of Ban- 
cilhon and Paredaens to semi-structured databases. The natural next question is 
whether Component Calculus collapses to Relational Calculus when inputs and 
outputs are relational. We show that when we consider instances of a predefined 
(i.e. conservative) schema A, then CC and Tuple Relational Calculus (TRC) 
indeed have the same expressive power. However, there are queries expressible 
in CC, such that no equivalent TRC expression has size polynomial in the size 
of A. In the case where instances can have arbitrary (i.e. liberal) schemas, we 
show that there are queries expressible in CC that are not expressible in TRC. 

Theorem 2. (1) The component calculus is representation independent, but 
not representation preserving. (2) The component algebra is both representation 
independent and representation preserving. 

This phenomenon is due to the declarative/procedural dichotomy between 
calculus and algebra. The Component Calculus “talks” about (connected) com- 
ponents of the input databases. It does not distinghuish between content equiva- 
lent representations of the same components. Therefore the calculus is represen- 
tation independent. On the other hand, the calculus always returns components 
in a canonical form (the minimal line tree). The calculus is thus not representa- 
tion preserving. The component algebra operates directly on the tree underlying 
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the contents of the database. The algebra is designed to upset the existing struc- 
ture as little as possible. On the other hand, the algebra is not syntax-dependent, 
in the sence that any query will produce content equivalent ouputs on content 
equivalent inputs. 

Notice that the semantics of CC is based on the extended active domain 
semantics. Thus, the issue of safety does not arise. We can show the equivalence 
of CC and CA. 

Theorem 3. The component calculus and algebra are equivalent in expres- 
sive power. That is, for every expression Ec of the component calculus, there 
is an expression Ea in the component algebra, such that for every instance 
X, Ec{l) =cont Ea{I), and vice versa. 

It is important to notice that calculus and algebra expressions will not pro- 
duce identical results, although the results will always be content equivalent. 



4.2 Instance Completeness 

In this section, we extend the classical notion of BP-completeness due to [8,27] to 
SSDBs. Recall that in the classical case, a language C is BP-complete provided 
for every instance X, for every query Q (that is computable and is invariant under 
automorphisms), there is an expression E in C such that E{T) = Q(X). In the 
case of SSDBs, because of structural variations among equivalent representations 
of the same information, we modify this notion as follows. In order to avoid 
confusion, we term the extended notion instance completeness, rather than BP- 
completeness. A query language C for SSDBs is instance complete provided for 
every instance X, for every query Q, there is an expression E in C such that 
E{X) =cont Q(X). We have the following results. 

Theorem 4. The component calculus and component algebra both compute 
only queries. 

Theorem 5. The component calculus is instance complete. 

Corollary 1. The component algebra is instance complete. 

Our proof is based on an extension of the technique developed by Paredaens. 
One of the novelties is finding a tree representation for the auto-embedding group 
of an instance [19]. 

4.3 Component vs. Relational Calculus 

A semi-structured database has its schema implicitly defined in the extension. 
Thus the schema is not known before processing the instance. The Component 
Calculus is designed so that it can deal with the situation in an instance indepen- 
dent manner. All we know is that the attribute names come from a countable set 
A. Thus, the relevant notion of schema in this case is that of a liberal schema. 
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On the other hand, if we have a predetermined collection of semi-structured 
databases, the set of all attributes occurring in the collection is finite, and one 
can apply the notion of the so-called conservative schema. 

To understand how CC compares with tuple relational calculus (TRC) in 
terms of expressive power, we need to extend the latter to handle inapplicable 
nulls [24]. This we do by introducing the equality = and adding the symbol 
T to TRC’s vocabulary. Specifically, the extended language, TRC-*-, consists of 
all atoms of TRC, together with t.A = s.B, and t.A = A, where t, s are tuple 
variables and A, B are attributes. Their semantics is exactly that in CC. 

Consider a set of attributes A = {^i, . . . , A„}, and relations over A, possibly 
containing inapplicable nulls. Define a query over such relations r as follows. 

{()}if {3t G r : t.Aij^ ± k t.Ani^ ±) 

or 3ti,t2 Gr : ti[Ai] & t 2 [An] & 

y A G A : \t\.A^ 1. & t2-A^ =j> ti.A = t2-A], 

{} otherwise. 

{()} if 3t G r : t.Ai^ A k t.An^ 3- 

or 

3ti,t2 Gr : ti[Ai] k t 2 [^n] k 
\/A G A : ti-A^J- k t2-A^ => t\.A = t2-A. 

{} otherwise. 

By adopting the technique for simulating relational databases as SSDBs (e.g., 
see [11]) with a 0 placed at the root node, we can see that is a query over 
SSDBs in the sense of Definition 2. For simplicity, below we blur the distinction 
between a relational database and its SSDB simulation. We have the following 
results. 

Theorem 6. There is an expression in the component calculus, with size linear 
in the number of attributes in A, that computes 

Theorem 7. (1) There is an expression in TRC-*- that computes (2) 

However, there is no expression E in TRC-*-, with size polynomial in the number 
of attributes in A, such that E computes Qrei- 

Recall that A denotes the countable set of all possible attributes. In most 
“real-life” SS applications, the schema of the potential input databases is not 
known, just as the number of tuples in a relation in the classical model is not 
known in advance. In this case, the notion of liberal schema, i.e. an arbitrary 
finite subset of A, not known a priori, applies. We extend the above query 
to this situation as follows. Let T be a relation over an arbitrary finite subset 
A C A, which is not known a priori. The definition of is exactly the same 
as that of for the conservative schema. 

The CC query expression 

(t I (3t :T(t)At[Ai]At[A„])V 

(3t, si,S 2 : T(si) A T(s 2 ) A si[Ai] A S 2 [A„] A si C t A S 2 E i)) 



QrelW = ^ 
QrelW = ^ 
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clearly computes Indeed, the same formula computes when the schema 
is conservative. We have 

Theorem 8. The query is expressible in the component calculus. However, 
there is no expression in TRC'*" which can compute this query. 

A restricted version of this result has been independently discovered recently 
by van den Busche and Waller [15]. 



5 Where Do the Labels Come Prom? 

Our data model assumes that the trees are approprately labeled with blockers. 
How does the system obtain these blockers? There are two answers to this ques- 
tion. First, the data may be “raw data” obtained from the web. In this case, 
schema mining techniques similar to [26] can be used. We do not explore this 
approach further in this context. Second, the trees might be the result of ex- 
porting structured or semi-structured data into a common exchange format. We 
contend that our enhanced OEM model is well suited for this purpose. To sub- 
stantiate this claim, we briefly sketch how relations, nested relations, and XML 
data can represented in our enhanced OEM model in a manner that makes the 
fact structure explicit. 

Relational data. A tuple [A : a, H : 6, C : c] can be modelled as any 
connected component with three nodes, having atttribute- value pairs (A : a), 
{B : b), and {C : c), respectively. A relation R = {ti,...,t„} is modelled as 
a tree with root with attribute-value pair {R : _L) and label 0, and subtrees 
corresponding to the components for the tuples ti, . . . , 

Nested Relational data. 

Relational tuples and relations are modelled as above. A possibly nested tuple 
[Ai : ui, . . . , An : Vn] is represented as a tree with a null node as root. The root 
has children representing each pair (A^ : Vi) If Vi is an atomic value, then the 
pair is modelled as above. Otherwise we create node with value (A^ : _L), and 
a horisontal blocker If Vi is a tuple, then the created node has one child, which 
is the root of the component representing the tuple Vi. If Vi is a set, then the 
created node also has a vertical blocker, and children representing the elements. 

XML data. The structure of XML documents are described through Data 
Type Definitions (DTD’s) [10]. In Figure 6 we see a simple example of a DTD 
specification for a course calendar. The specification states that a calendar con- 
sists of graduate and undergraduate courses. A course entry gives a title and a 
schedule for a course, or just the title. 

Using the DTD in Figure 6, an XML instance can be parsed into a labeled tree 
in the obvious way. All we need to add is the placement of the blockers 0, 0 in the 
parsed instance. For this we use the following (informal) rules: (1) A node labeled 
with an element name that appears in a list specification (e. g. Gradcourse) is 
labeled with 0. (2) The parent of a node labeled with an element name that 
appears in a star specification (e. g. Gradcourse) is labeled with 0. (3) Leaves 
are not labeled with blockers. 
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<DOCTYPE Calendar 

<! ELEMENT Calendar (Gradcourse, Ugradcourse)> 

<!ELEMENT Gradcourse (Entry*)> 

<!ELEMENT Ugradcourse (Entry*)> 

<! ELEMENT Entry (Title, Schedule) I Title> 

<! ELEMENT Title CharData> 

<! ELEMENT Title Chardata> 

]> 

Fig. 6. A DTD Specification 

Following the method sketched above, the XML instance in Figure 7 would 
result in the database tree in Figure 8. Note that the method is informal. We 
are currently investigating automated wrapping of XML data into our enhanced 
OEM- model. 



<CALENDAR> <GRADCDURSE> <ENTRY> Databases. Mo-We-5-9 <\ENTRY> 

<ENTRY> OLAP<\ENTRY> 

<\GRADCOURSE> 

<UGRADCOURSE> <ENTRY> Datastructures <\ENTRY> 
<\UGRADCOURSE> 

<\CALENDAR> 



Fig. 7. An XML instance 



6 Related Work 

In a recent paper, Abiteboul and Vianu [1] study generic computations over the 
world wide web. They model the web as a simple relational database with links 
as relations, and equipped with certain functional and inclusion dependencies. 
Their notion of genericity is based on the classical database isomorphisms on the 
simple relational database. Abiteboul and Vianu study the expressive power of 
existing languages, such as FO and Datalog. In a related paper, Mendelzon and 
Milo [23] propose essentially an object-based version of the model of Abiteboul 
and Vianu. The notion of genericity is still confined to the classical domain 
permutations. Mendelzon and Milo propose a web-calculus that operates on 
web-objects. Both aforementioned works [1,23] deal only with world wide web 
computations, and their analyses do not spell out the fact structure of SSDBs. 
As far as we know, the notions of representation independence and preservation 
have not been addressed before. 
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Fig. 8. The Labeled Tree 



7 Summary and Future Work 

BP-completeness, originally proposed in [8,27] was a breakthrough which led 
to the subsequent development of the notions of genericity and complete query 
languages by [16], which in turn plays a central role in the theory of classi- 
cal database queries. BP-completeness essentially provides an instance depen- 
dent characterization of the expressive power of a query language. The analog 
of BP-completeness for object creating languages was developed by [5] which 
subsequently led to the development of an appropriate extension of the notion 
of genericity as well as a complete query language for the graph-based object- 
oriented GOOD data model [14]. 

In this paper, we have extended BP-completeness for SSDBs. In this connec- 
tion, we have introduced the notion of representation independence for SSDB 
query languages, which can be regarded as a first important step toward achiev- 
ing the repository independence envisioned by Hull [21]. We have also introduced 
a notion of genericity and representation preservation for SSDB query languages. 
The latter captures the spirit of pure querying as opposed to restructuring. We 
have defined a calculus and an equivalent algebra for SSDBs, and shown that 
both languages are generic and representation independent, and that the algebra 
is also representation preserving, thus lending itself to efficient implementation. 
Both languages are instance complete. Finally, the component calculus can ex- 
press queries much more concisely than TRC-*~, an extension to classical TRC for 
handling inapplicable nulls. Moreover, when the schema is not predefined, CC 
can express queries that TRC-*- cannot. 

It is our hope that work presented in this paper will be useful in the de- 
velopment of a comprehensive theory of queries for SSDBs. Clearly, much work 
remains to be done. In this paper we have concentrated on SSDBs which are 
forests. Handling more general SSDBs is an important problem. In this vein, a 
particularly promising direction is to explore a model in the spirit of the hy- 
pertrees of [6] with cross-links between trees, possibly with cycles, rather than 
a completely arbitrary digraph. Another interesting direction is to extend the 
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theory to include restructuring, a feature already exhibited by many of the lan- 
guages surveyed in Section 1. 
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Abstract. Applications have an increasing need to manage semistruc- 
tured AzXz. (such as data encoded in XML) along with conventional struc- 
tured data. We extend the structured object database model ODMG 
and its query language OQL with the ability to handle semistructured 
data based on the OEM model and Lorel language, and we implement 
our extensions in a system called Ozone. In our approach, structured 
data may contain entry points to semistructured data, and vice-versa. 
The unified representation and querying of such “hybrid” data is the 
main contribution of our work. We retain strong typing and access to 
all properties of structured portions of the data while allowing flexible 
navigation of semistructured data without requiring full knowledge of 
structure. Ozone also enhances both ODMG/OQL and OEM/Lorel by 
virtue of their combination. For instance. Ozone allows OEM semantics 
to be applied to ODMG data, thus supporting semistructured-style nav- 
igation of structured data. Ozone also enables ODMG views of OEM 
data, allowing standard ODMG applications to access semistructured 
data without losing the benefits of structure. Ozone is implemented on 
top of the ODMG-compliant O 2 database system, and it fully supports 
our extensions to the ODMG model and OQL. 



1 Introduction 

Database management systems traditionally have used data models based on reg- 
ular structures, such as the relational model [Cod70] or the object model [Cat94]. 
Meanwhile, the growth of the internet and the recent emergence of XML [LB97] 
have motivated research in the area of semistructured data models, e.g., [BDS95, 
FFLS97, PGMW95]. Semistructured data models are convenient for represent- 
ing irregular, incomplete, or rapidly changing data. In this paper, we extend the 

* This work was supported by the Air Force Rome Laboratories under DARPA Gon- 
tract F30602-95-G-0119. 

R. Connor and A. Mendelzon (Eds.): DBPL’99, LNCS 1949, pp. 297-323, 2000. 
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standard well-structured model for object databases, the OZlAfG model [Cat94], 
and its query language, OQL, to integrate semistructured data with structured 
data. We present our implementation of the extended ODMG model and query 
language in a system called Ozone. 

We will see that Ozone is well suited to handling hybrid data — data that 
is partially structured and partially semistructured. We expect hybrid data 
to become more common as more applications import data from the Web, 
and the integration of semistructured data within ODMG greatly simplifies 
the design of such applications. The exclusive use of a structured data model 
for hybrid data would miss the many advantages of a semistructured data 
model [Abi97, Buu97] — structured encodings of irregular or evolving semistruc- 
tured data are generally complex and difficult to manage and evolve. On the 
other hand, exclusive use of a semistructured data model precludes strong typ- 
ing and efficient implementation mechanisms for structured portions of the data. 
Our approach based on a hybrid data model provides the advantages of both 
worlds. 

Our extension to the ODMG data model uses the Object Exchange Model 
[OEM) [PGMW95] to represent semistructured portions of the data, and it al- 
lows structured and semistructured data to be mixed together freely in the same 
physical database. Our OQL^ query language for Ozone is nearly identical to 
OQL [Gat94] but extends the semantics of OQL for querying hybrid data. An 
interesting feature of our approach is that it also enables structured data to be 
treated as semistructured data, if so desired, to allow navigation of structured 
data without full structural knowledge. Gonversely, it enables structured views 
on semistructured data, allowing standard ODMG applications access to semi- 
structured data. We have implemented the full functionality of Ozone on top of 
the O 2 [BDK92] ODMG-compliant database management system (a product of 
ArdentSoftware Inc., http://www.ardentsoftware.com). 

Related Work 

Data models, query languages, and systems for semistructured data are areas 
of active research. Of particular interest and relevance, extensible Markup Lan- 
guage [XML) [LB97] is an emerging standard for Web data, and bears a close cor- 
respondence to semistructured data models introduced in research, e.g., [BDS95, 
FFLS97, PGMW95]. An example of a complete database management system 
for semistructured data is Lore [MAG+97], a repository for OEM data featur- 
ing the Lorel query language. Another system devoted to semistructured data 
is the Strudel Web site management system, which features the StruQL query 
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language [FFLS97] and a data model similar to OEM. UnQL [BDHS96, BDS95] 
is a query language that allows queries on both the content and structure of 
a semistructured database and also uses a data model similar to Strudel and 
OEM. All of these data models, languages, and systems are dedicated to pure 
semistructured data. We know of no previous research that has explored the 
integration of structured and semistructured data as exhibited by Ozone. Note 
also that our query language OQL'^ is supported by a complete implementation 
in the Ozone system. 

There has been some work in extracting structural information and build- 
ing structural summaries of semistructured databases. For example, [NAM98] 
shows how schema information can be extracted from OEM databases by typ- 
ing semistructured data using Datalog. Structural properties of semistructured 
data can be described and enforced using graph schemas as shown in [BDFS97]. 
Structural summaries called DataGuides are used in the Lore system as descri- 
bed in [GW97]. These lines of research are dedicated to finding the structural 
properties of purely semistructured data and do not address the integration of 
structured and semistructured data as performed by Ozone. 

The OQL-doc query language [ACC+97] is an example of an OQL extension 
with a semistructured flavor: it extends OQL to navigate document data with- 
out precise knowledge of its structure. However, OQL-doc still requires some 
form of structural specification (such as an XML or SGML Document Type Def- 
inition [DTD) [LB97]), so OQL-doc does not support the querying of arbitrary 
semistructured data. 

2 Background and Motivating Example 

2.1 The structured ODMG model and OQL 

The ODMG data model is the accepted standard for object databases [Gat94]. 
ODMG has all the necessary features of object-orientation: classes with at- 
tributes and methods, subtyping, and inheritance. The basic primitives of the 
model are objects (values with unique identifiers) and literals (values without 
identifiers). All values in an ODMG database must have a valid type defined 
in the schema, and all values of an object type or class are members of a col- 
lection known as the extent for that class. Literal types include atomic types 
(e.g., integer, real, string, nil, etc.), structured types with labeled components 
(e.g. tuple(a:integer, b:real)), and collection types (set, bag, list, and array). 

A class type encapsulates some ODMG type, and may define methods that 
specify the legal set of operations on objects belonging to the class. A class may 
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also define relationships with other classes. Classes most commonly encapsnlate 
strnctnred types, and the different fields in the encapsnlated strnctnre (along 
with methods withont argnments) denote the attributes of the class. ODMG 
defines the class Object to be the root of the class hierarchy. An attribnte of 
a class in the ODMG model may have a literal type and therefore not have 
identity. Named objects and literals form entry points into an ODMG database. 

The Object Query Language, or OQL, is a declarative qnery langnage for 
ODMG data. It is an expression- oriented qnery langnage: an OQL qnery is com- 
posed of one or more expressions or snbqneries whose types can be inferred 
statically. Gomplex qneries can be formed by composing expressions as long as 
the compositions respect the type system of ODMG. Details of the ODMG model 
and the OQL qnery langnage can be fonnd in [Gat94], bnt are not essential for 
nnderstanding this paper. 



2.2 The semistructured OEM model and Lorel 

The Object Exchange Model, or OEM, is a self-describing semistrnctnred data 
model, nsefnl for representing irregnlar or dynamically evolving data [PGMW95]. 
OEM objects may either be atomic, containing atomic literal valnes (of type in- 
teger, real, string, binary, etc.), or complex, containing a set of labeled OEM snb- 
objects. A complex OEM object may have any nnmber of children (snbobjects), 
inclnding mnltiple children with the same label. Note that all OEM snbobjects 
have identity, nnlike ODMG class attribntes. An OEM database may be viewed 
as a labeled directed graph, with complex OEM objects as internal nodes and 
atomic OEM objects as leaf nodes. Named OEM objects form entry points into 
an OEM database. 

Lorel is a declarative qnery langnage for OEM data and is based on OQL. 
Some important featnres of Lorel are listed below. Details of OEM and Lorel can 
be fonnd in [AQM+97], bnt again are not crncial to nnderstanding this paper. 

— Path expressions: Lorel qneries navigate OEM databases nsing path expres- 
sions, which are seqnences of labels that may also contain wildcards and regn- 
lar expression operators. For instance, the qnery “Select D From A(.b I .c%)*.d 
D” selects all objects reachable from entry-point A by following zero or more 
edges each having either label b or a label beginning with the character c, 
followed by a single edge labeled d. 

— Automatic coercion: Lorel attempts to coerce operands to compatible types 
whenever it performs a comparison or other operation on them. For instance. 
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if X is an atomic OEM object with the string valne “4” , then for the evalna- 
tion of X < 10, Lorel coerces X to the integer valne 4. If no snch coercion is 
possible (for instance, if X were an image or a complex object) the predicate 
retnrns false. Lorel also coerces between sets and singleton valnes whenever 
appropriate. 

— No type errors: To allow flexible navigation of semistrnctnred data, Lorel 
never raises type errors. For instance, an attempted navigation from an OEM 
object nsing a nonexistent label simply prodnces an empty resnlt, and a 
comparison between non-comparable valnes evalnates to false. Thns, any 
Lorel qnery can be execnted on an OEM database with nnknown or partially 
known strnctnre, withont the risk of rnn-time errors. 



2.3 Example of hybrid data and queries 

Onr motivating example, nsed thronghont the paper, considers a database be- 
hind a simplified on-line broker that sells prodncts on behalf of different com- 
panies. There are three ODMG classes in this database: Catalog, Company, and 
Product. Class Catalog has one object, which represents the on-line catalog main- 
tained by the broker. The object has two attribntes: a vendors attribnte of type 
set(Company), denoting the companies whose prodncts are sold in the catalog, 
and a products attribnte of type set(Product), denoting the prodncts sold in the 
catalog. The Company class defines a one-to-many produces relationship with the 
class Product of type list(Product). This relationship specifies the list of prod- 
ncts mannfactnred by the company, ordered by prodnct nnmber. Likewise, the 
Product class defines the inverse many-to-one madeby relationship with the class 
Company, denoting the prodnct’s mannfactnrer. The Company class contains 
other attribntes snch as name and address, and an inventory() method that takes 
a prodnct name argnment and retnrns the nnmber of stocked nnits of the prod- 
nct of that name. The Product class contains other attribntes snch as name and 
prodnum (prodnct nnmber). The named object Broker of type Catalog provides 
an entry point to this database. Fignre 1 depicts this schema withont atomic 
attribntes. 

In addition to this strnctnred data, let ns snppose that we have prodnct- 
specific XML information available for some prodncts, e.g., drawn from Web 
sites of companies and analyst Arms. This data might inclnde mannfactnrer 
specifications (power ratings, weight, etc.), compatibility information if it ap- 
plies (for instance, the strobes compatible with a particnlar camera), a listing of 
competing companies and prodncts, etc. To integrate this XML data within onr 
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Broker 




Fig. 1. Structured ODMG classes in the retail-broker database 



database, we enhance the Product class with a prodinfo attribnte for this prodnct- 
specific data. Since this data is likely to vary widely in format, we cannot easily 
nse a fixed ODMG type for its representation, and it is mnch more convenient to 
nse the semistrnctnred OEM data model. Therefore, we let the prodinfo attribnte 
be a “crossover point” (described below) from ODMG to OEM data. 

There is also a need for referencing strnctnred data from semistrnctnred 
data. If a competing prodnct (or company) or a compatible prodnct appears in 
the broker’s catalog, then it shonld be represented by a direct reference to the 
ODMG object representing that prodnct or company. If the competing prodnct 
or company is not part of the catalog, only then is a complex OEM object created 
to encode the XML data for that prodnct or company. 




Fig. 2. Example OEM graph for the prodinfo attribute of a Product object 



An example OEM database graph for the prodinfo attribnte of a prodnct is 
shown in Fignre 2. Note that in Fignre 2, the competing prodnct named “Sys- 
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tern 12” is not part of the catalog database and therefore is represented by a 
(complex) OEM object; the other competing prodnct and company are part of 
the catalog and are represented by references to Product and Company objects. 

To continne with the example, let ns also snppose that we have some review 
data available in XML for prodncts and companies. The information is available 
from Web pages of different review agencies and varies in strnctnre. We enhance 
onr example database with a second entry point: the named object Reviews inte- 
grates all the XML review data from different agencies. Once again, the diverse 
and dynamic natnre of this data means that it is better represented by the OEM 
data model than by any fixed ODMG type. Thns, Reviews is a complex OEM 
object integrating available reviews of companies and prodncts. Here too we may 
reference strnctnred data from semistrnctnred data, since reviewed companies 
and prodncts that are part of the catalog shonld be denoted by references to the 
ODMG objects representing them. 



Reviews 




Fig. 3. Semistructured Reviews data for the broker catalog 



Figure 3 is a simplified example of this semistrnctnred Reviews data. We 
assume that the reviews by a given agency reside under distinct subobjects of 
Reviews, and the names of the review agencies (Consumersinc, ABC_Consulting, 
etc.) form the labels for these subobjects. For subsequent examples, we restrict 
ourselves to reviews by Consumersinc. Reviews by this agency have a subject 
subobject denoting the subject of the review (either a product or a company), 
which may be a reference to the ODMG object representing the company or 
product, or may be a complex OEM object. Both cases are depicted in Figure 
3. 

Our overall example scenario consists of hybrid data. Some of the data is 
structured, such as the Product class without the prodinfo attribute, while some 
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of the data is semistructured, such as the data reachable via a prodinfo attribute 
or via the Reviews entry point. 

3 The Extended ODMG Data Model 

Our basic extension to the ODMG data model to accommodate semistructured 
data is therefore relatively straightforward. We extend the ODMG model with 
a new built-in class type OEM. Using this OEM type, we can construct ODMG 
types that include semistructured data. For instance, we can define a partially 
semistructured Product class (as described above) with a prodinfo attribute of 
type OEM. There is no restriction on the use of the type OEM — it can be used 
freely in any ODMG type constructor, e.g., tuple(x:OEM, y:integer) and list(OEM) 
are both valid types. 

Objects in the class OEM are of two categories: OEMcomplex and OEMatomic, 
representing complex and atomic OEM objects respectively.^ An OEMcomplex 
object encapsulates a collection of [label, value) pairs, where label is a string and 
value is an OEM object. The original OEM data model specification included 
only unordered collections of subobjects [AQM+97, PGMW95], but XML, for 
example, is inherently ordered. Thus we allow complex OEM objects with ei- 
ther unordered or ordered subobjects in our data model, and refer to them as 
OEMcomplexset and OEMcomplexIist respectively. 

To allow semistructured references to structured data, the value of an OE- 
Matomic object may have any valid ODMG type (including OEM). Thus, apart 
from the ODMG atomic types integer, real, string, etc., the value of an OE- 
Matomic object may for example be of type Product, tuple(a:integer, b:OEM), 
etc. When the content of an OEMatomic object is of type T, we will say that its 
type is OEM(T). Since OEM objects are actually untyped, OEM(T) denotes a 
“dynamic type” that does not impose any typing constraint. For example, an ob- 
ject of type OEM(integer) may be compared with an object of type OEM(string) 
or OEM(set(Product)) without raising a type error; further discussion of such 
operations is provided in Section 5.3. Intuitively, atomic OEM objects can be 
thought of as untyped containers for typed values. Note that an OEMatomic ob- 
ject of type OEM(OEM) can be used to store a reference to another OEM object 
(possibly external to the database). Also note that OEM(nil) is a valid type for an 
OEMatomic object, and we assume that there is a single named object OEM Nil 
of this type in the OEM class. 

^ These categories do not represent subclasses since, as we will see, OEM objects are 
untyped. 
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4 Benefits of a Hybrid Approach 

We now reinforce the benefits of a hybrid approach. An important advantage 
of onr approach over a pnrely strnctnred approach is that we can formnlate 
qneries on semistrnctnred portions of the data withont reqniring fnll strnctnral 
knowledge. With a pnrely strnctnred approach, representation of XML data, for 
example, wonld reqnire a different set of ODMG classes for each distinct XML 
DTD, possibly leading to complex schemas that the nser wonld be reqnired to 
have fnll knowledge of in order to formnlate valid qneries. Fnrthermore, modifi- 
cations to the XML data might reqnire expensive schema evolntion operations. 
In contrast, the OEM model does not rely on a known schema, and the semantics 
of Lorel permits formnlating qneries withont fnll knowledge of strnctnre. 

At the same time, an important benefit of onr approach over a pnrely semi- 
strnctnred approach snch as Lore [MAG"*" 97] is that we are capable of exploiting 
strnctnre when it is available. In particnlar, we can more easily take advantage 
of known qnery optimization techniqnes for strnctnred data, and we can take 
advantage of strong typing when portions of the data are typed. 

Finally, we can optionally apply the semantics of one data model to the 
other, so that the benefits of both models are available to ns whether the data 
is strnctnred or semistrnctnred. For instance, treating ODMG data as OEM 
allows qneries to be written withont complete knowledge of the schema, while 
still retaining access to all ODMG properties (snch as methods, indexes, etc.). 
On the other hand, we will show in Section 5.2 how onr approach enables typed 
ODMG views of nntyped OEM data, so that standard ODMG applications can 
access semistrnctnred data nsing standard API’s and strnctnral optimizations. 

5 The OQL'® Query Language 

The qnery langnage for the Ozone system is OQL^ . OQL'^ is not a new qnery 
langnage — except for some bnilt-in fnnctions and syntactic conveniences derived 
from Lorel, it is syntactically identical to OQL. The semantics of OQL'^ on strnc- 
tnred data is identical to OQL on standard ODMG data. OQL'^ extends OQL 
with additional semantics that allow it to access semistrnctnred data. The semi- 
strnctnred capabilities of OQL'^ are mostly derived from Lorel, which is based on 
OQL bnt was designed specifically for qnerying pnre semistrnctnred data. Like 
Lorel, OQL'^ allows qnerying of semistrnctnred data withont the possibility of 
rnn-time errors. OQL'^ also provides new featnres necessary for the navigation 
of hybrid data: since OQL'^ expressions can contain both strnctnred and semi- 
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structured operands, OQL'^ defines new semantics that allow such queries to be 
interpreted appropriately. 

Space limitations preclude a complete specification for OQL'^ in this pa- 
per. Since its syntax combines OQL and Lorel, interested readers are referred 
to [AQM+97, Cat94]. In the remainder of this section we describe some of the 
more interesting aspects of the semantics of OQL'^, using simple self-explanatory 
queries to illustrate the points. In Section 5.1 we describe path expression “cross- 
overs” from structured to semistructured data and vice-versa. In Section 5.2 we 
describe how our approach enables structured ODMG views over semistructured 
data. Section 5.3 discusses the semantics of OQL'^ constructs such as arithmetic 
and logical expressions involving hybrid operands. 

5.1 Path expression crossovers 

When we evaluate a path expression in an OQL'^ query (recall Section 2.2), 
a corresponding database path may involve all structured data, all semistruc- 
tured data, or there may be crossover points that navigate from structured 
to semistructured data or vice-versa. Crossing boundaries from structured to 
semistructured data is fairly straightforward, since we can always identify the 
crossover points statically. For example, the following query selects the names 
of all competing products and companies for all products in the broker catalog 
from Section 2.3: 

Select I 

From Broker .products P, P .prodinfo . competing C, C.name I 

P is statically known to be of type Product, but prodinfo is an OEM attribute, 
and C is therefore of type OEM; prodinfo is thus a crossover point from structured 
to semistructured data. 

Semistructured to structured crossover is more complicated, and we focus on 
this case. It is not possible to identify such crossover points statically without 
detailed knowledge of the structure of a hybrid database. To define the semantics 
of queries with transparent crossover from semistructured to structured data, we 
introduce (below) the logical concept of OEM proxy for encapsulating structured 
data in OEM objects. We also discuss in Section 5.1 an explicit form of crossover 
for users who do have detailed knowledge of structure. 

In the purely semistructured OEM data model, atomic OEM objects are leaf 
nodes in the database graph. The result of attempting to navigate an edge from 
an atomic OEM object is defined in Lorel to be the empty set, i.e., the result of 
evaluating X.label is empty if X is not a complex OEM object. However, in our 
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extended ODMG model, OEMatomic objects are containers for valnes with any 
ODMG type (recall Section 3), so in addition to containing atomic valnes, they 
may provide semistrnctnred crossover to strnctnred data. 

Thns, OQL'^ extends Lorel path expressions with the ability to navigate 
strnctnred data encapsnlated by OEMatomic objects, in addition to navigating 
semistrnctnred data represented by OEMcomplex objects. In onr rnnning exam- 
ple, some of the objects in the Reviews graph labeled subject are OEMatomic 
objects of type OEM(Product) and OEM(Company). Thns, the following qnery 
has a semistrnctnred to strnctnred crossover since some of the bindings for C are 
OEM(Company) and OEM(Product) objects: 

Select A 

From Reviews . Consumers Inc R, R. subject C, C. address A 

For C bindings of type OEM(Company), the evalnation of C. address generates the 
OEM proxy (defined below) of the address attribnte in the Company class. For 
C bindings of type OEM(Product), the evalnation of C. address yields the empty 
set. 

To allow flexible navigation of strnctnred data from semistrnctnred data, at 
the same time retaining access to all properties of the strnctnred data, we define 
the logical notion of OEM proxy objects, as follows: 

Definition 1. (OEM Proxy) An OEM proxy object is a temporary OEM 
object created (perhaps only logically) to encapsnlate a valne of any ODMG 
type. It is an OEMatomic object that serves as a proxy or a snrrogate for the 
valne it encapsnlates. □ 

Semistrnctnred to strnctnred crossover is accomplished by (logically) creating 
OEM proxy objects, perhaps recnrsively, to contain the resnlt of navigating 
past an OEMatomic object. It is important to note that this concept of OEM 
proxy is a logical rather than a physical concept. It specifies how a qnery over 
hybrid data is to be interpreted, bnt does not specify anything abont the actnal 
implementation of the qnery processor. All we reqnire is that the resnlt of a 
qnery shonld be the same as the resnlt that wonld be prodnced if proxy objects 
were actnally created for every navigation past every OEMatomic object. For 
C bindings of type OEM(Company) in onr example qnery, the corresponding 
A bindings are OEM proxy objects of type OEM(tuple(street:string, city:string, 
zip:integer)) encapsnlating the address attribnte of the corresponding Company 
objects. 
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The general algorithm for evaluating X.l when X is an OEMatomic object 
of type OEM(T) encapsulating the value Y follows. Consider the different cases 
for T: 

1. T is an atomic ODMG type (i.e., one of integer, real, char, string, boolean, 
binary, or nil): For any label I, the result of evaluating X.l is the empty set. 

2. T is a tuple type: For all non-collection fields in the tuple whose labels match 
I (note that a label with wildcards can match more than one field) a proxy ob- 
ject is created encapsulating the value of the field. If the label of a collection- 
valued field matches I, proxies are created encapsulating each element of the 
collection. The result of evaluating X.l is the set of all such proxies. 

3. T is a collection: If T is a set or a bag type, X.l returns a set. This set is 
empty unless the label I is the specific label item. For this built-in system 
label, the value of X.l is a set of OEM proxies encapsulating the elements 
in the collection Y . If T is a list or an array type, X.l is evaluated similarly, 
except that the ordering of the elements of Y is preserved by returning a list 
of proxies instead of a set. Note that navigation past such objects requires 
some knowledge of the type T, since the user needs to use label item (or a 
wildcard) to navigate below the encapsulated collection. 

4. T is a class: Here, Y is an object encapsulating some value — let Z be an 
OEM proxy for that value. The result of evaluating X.l is a set including 
the OEM proxies obtained by evaluating Z.l (by recursively applying these 
rules), the OEM proxies encapsulating the values of any relationships with 
names matching I, and the OEM proxies encapsulating the results of invoking 
any methods with names matching L If T is OEM, X is a reference to an 
OEM object Y , and the result of X.l is the same as the result of evaluating 
Y.l, i.e., automatic dereferencing is performed. 

To illustrate these rules, consider the following query, which selects the names 
of all products manufactured by all companies reviewed by Consumersinc: 

Select I 

From Reviews . Consumersinc R, R. subject C, 

C. produces P, P.name I 

Here, for those C bindings that are of type OEM(Company), the corresponding 
P bindings are proxies of type OEM(Product), encapsulating the elements of the 
relationship produces in the Company class. Finally, the N bindings encapsulate 
the name attributes of the Product objects encapsulated by the P bindings, and 
the type of these N bindings is OEM(string). Proxies thus allow “transparent 
navigation” to structured data from semistructured data. 
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Queries should be able to access all properties of structured data referenced 
by semistructured data, and OQL'^ therefore allows queries to invoke methods 
with arguments on structured objects encapsulated by OEMatomic objects. The 
expression X.m{argi, arg2, ■ ■ ■ , argn) applies the method m() with the specified 
list of arguments to the object encapsulated by X. If X is of type OEM(T), the 
result of this expression is a set containing the OEMatomic object encapsulating 
the return value of the method, provided T is a class that has a method of this 
name and with formal parameters whose types match the types of the actual 
parameters argi,arg2, ■ ■ - argn- If T is not a class, or if it is a class without 
a matching method, or if X is of type OEMcomplex, this expression returns 
the empty set. As an example, the following query selects the inventories of 
“cameral” for all companies in the catalog and reviewed by Consumersinc: 

Select C.inventory("cameral") 

From Reviews . Consumersinc R, R. subject C, 

For those C bindings that are not of type OEM(Company), the evaluation of 
C.inventory( “cameral” ) yields the empty set. For those C bindings that are of type 
OEM(Company), the evaluation of C.inventory( “cameral” ) returns a singleton set 
containing an OEM(integer) object encapsulating the result value. 

OQL allows casts on objects, and as a ramification of our approach, OQL'^ 
allows any object to be cast to the type OEM by creating the appropriate proxy 
for the object, allowing semistructured-style querying of structured data. For 
instance, an object C of type Company can be cast to an OEM proxy object of 
type OEM(Company) by the expression (OEM) C. Once this casting is performed, 
the proxy can be queried without full knowledge of its structure. This casting 
approach is useful when users have approximate knowledge of the structure of 
some (possibly very complex) structured ODMG data but prefer not to study its 
schema in detail. Queries with casts to OEM are also useful when the structure 
of the database changes frequently and users want the same query to run against 
the changing database without errors. 

Semistructured to structured crossover by explicit coercion. OQL'^ pro- 
vides another mechanism for accessing structured data from semistructured data: 
a modified form of casting that extracts the structured value encapsulated by 
an OEMatomic object. Although OEM proxies allow all properties of structured 
data contained in OEMatomic objects to be accessed without casts, casts enable 
static type checking. Furthermore, casting a semistructured operand to its true 
structured type may also provide performance advantages, since it may allow 
the query processor to exploit the known structure of the data. 
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In OQL, a standard cast on a strnctnred operand “[T)X” may prodnce a 
rnntime error if the X binding is not a snbtype of T. However, this approach 
is not snitable for casts on OEM objects, since it contradicts onr philosophy of 
not mandating strnctnral knowledge of semistrnctnred data. Therefore, OQL'^ 
provides a separate mechanism for performing casts on OEM objects withont 
type error, throngh the bnilt-in Coerce fnnction defined as follows: 

Definition 2. (The Coerce function) Let O be an OEM object. The valne 
of Coerce(C', O) is the singleton set set((C') X) if O is an OEMatomic object 
encapsnlating an object X in class C (or a snbclass of C). Otherwise the valne 
of Coerce(C', O) is the empty set. □ 

As an example, the following qnery selects the prodncts of all Company snb- 
jects of reviews by Consumersinc. Since the type of the C bindings is known to 
be Company, the type of the resnlt retnrned by the qnery can be determined 
statically to be set(list(Product)): 

Select P 

From Reviews . Consumersinc R, R. subject S, 

Coerce (Compamy, S) C, C. produces P 

In the fntnre we may also consider a more powerfnl “case” constrnct that snc- 
cessively tries to coerce an OEM object into a strnctnred object from a list of 
classes, retnrning the first non-empty resnlt obtained. 

5.2 Structured access to semistructured data 

A powerfnl featnre of OQL'^ is its snpport for structured views of semistrnc- 
tnred data. Intnitively, strnctnred data can be synthesized from semistrnctnred 
operands when the semistrnctnred data is known to exhibit some regnlarity, e.g., 
based on an XML DTD or from an analysis of the data. Snch strnctnred views 
may provide faster access paths for qneries (e.g., via standard indexes), and the 
strnctnred resnlts can be exported to standard ODMG applications that do not 
nnderstand the OEM model, and that may nse API’s snch as Java or C-1--1- 
bindings to access the database. 

The synthesis of strnctnred data from semistrnctnred data is accomplished 
in OQL'^ (once again, withont the possibility of type error) nsing the bnilt-in 
Construct fnnction defined as follows: 

Definitions. (The Construct function) Let O be an OEM object. The ex- 
pression Construct(T, O) retnrns a valne of type set(T) that is either a singleton 
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set containing a valne of type T constrncted from the OEM object O, or the 
empty set if no snch constrnction is possible. □ 

The Construct fnnction may be viewed as a rich coercion fnnction from OEM 
to a given type. If O is an OEMatomic object, then Construct behaves similarly 
to Coerce in Definition 2 above. If O is an OEMcomplex object, then Construct 
creates a strnctnred ODMG tnple. Construct(T, O) is defined recnrsively as 
follows: 

1. If O is an OEMatomic object of type OEM(T'), and if T' is identical to 
or is coercible to a snbtype of T, then a singleton set containing the valne 
encapsnlated by O is retnrned. 

2. If T is a class encapsnlating a type T', and if Construct(T', O) = {u}, then 
Construct(T, O) is a singleton set containing a new T object encapsnlating 
the valne v. 

3. If T is a tnple type, then each field labeled I mnst be constrncted: 

(a) If 1 is a collection of valnes of type T' , then for each /-labeled snbobject 
O' of O, we evalnate Construct(T', O'). The resnlt vi is a collection of the 
non-empty resnlts of this evalnation. If O is an OEMcomplexIist object 
and I is an ordered collection, the ordering of the snbobjects of O is 
preserved in the constrnction. Otherwise, an arbitrary order is nsed for 
the resnlting collection vi. 

(b) I has a non-collection type T': The constrnction is snccessfnl if there is 
exactly one /-child O' of O and if Construct(T', O') = {u;}. 

Finally, Construct(T, O) is a singleton set containing the tnple with valne vi 
for each /-field in the tnple. 

4. If T is a collection of valnes of type T', then for each snbobject O' of O 
with the reserved label item, we evalnate Construct(T', O'). The resnlt is a 
collection of the non-empty resnlts of these evalnations. Once again, if T is 
an ordered collection, the ordering of the snbobjects of O is preserved in the 
constrnction; otherwise an arbitrary order is prodnced. 

5. In all other cases. Construct retnrns an empty set. 

As an example, let ns snppose (simplistically) that we know that the mannfac- 
tnrer specifications for electrical prodncts in onr broker catalog always inclndes 
an integer wattage valne and a real weight valne. We define a class Espec encap- 
snlating the type tuple(wattage:integer, weight:real). The qnery below selects a 
strnctnred set of specifications: 




312 



Tirthankar Lahiri et al. 



Select E 

From Broker .products P, P .prodinfo . specs S, 

Construct (Espec, S) E 

The type of the S bindings is OEM, and the type of the E bindings is Espec. 
The result of the query is therefore of type set(Espec). Thus, the result is a set 
of structured objects that may be materialized for indexing purposes, and may 
easily be exported to a Java or C++ application. While this example is very 
simple (and in fact a similar effect could have been achieved by using a tuple 
constructor in the Select clause), it does illustrate the general principle, which 
is to create structured views over portions of the data that are semistructured 
but have known components. 



5.3 Semantics of mixed expressions 

In OQL, expressions (queries) can be composed to form more complex ex- 
pressions as long as the expressions have types that can be composed legally. 
OQL provides numerous operators for compositions, such as arithmetic opera- 
tors (e.g., +), comparison operators (e.g., <), boolean operators (e.g., AND), set 
operators (e.g., UNION), indexing operators (e.g., (listjname)\{position)\), etc. 
(See [Cat94] for an exhaustive specification of all OQL compositions.) OQL'^ 
extends the composition rules of OQL to allow semistructured and structured 
expressions to be mixed freely in such compositions. We refer to expressions that 
include a semistructured subexpression as mixed expressions. Space limitations 
preclude an exhaustive treatment of all possible OQL'^ expressions in this pa- 
per, but several important aspects of the interpretation of mixed expressions are 
highlighted in the remainder of this section. 

Run-time coercion is used in the evaluation of mixed expressions. 
Faithful to the Lorel philosophy, the OQL'^ query processor evaluates mixed 
expressions by attempting to coerce their subexpressions to types that can be 
composed legally. As an example, we consider the interpretation of compositions 
involving the comparison operator 

In OQL, the expression [X < Y) is legal provided X and Y both have 
types integer or real (interpreted as arithmetic comparison), string (interpreted 
as string comparison), boolean (interpreted as boolean comparison), or set(T) or 
bag(T) (interpreted as set inclusion). A type error is raised in OQL for all other 
cases. OQL'^ additionally allows X , Y , or both to be of type OEM, and the type 
of the mixed boolean expression [X < Y) in that case is also OEM: OEM(true), 
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OEM(false), or OEM Nil (recall Section 3 for definitions of these types). For in- 
stance, the valne of (OEM(4) < 5) and (OEM(4) < “5”) are both OEM(true). 
The valne of (OEM (4) < set(l, 3)) is OEM Nil since the two operands cannot be 
coerced into comparable types. 

OEM Nil is used to implement three- valued logic for mixed boolean 
expressions. Mixed boolean expressions are evalnated in OQL'^ according to 
the rnles of three-valned logic, similar to Lorel and jnst as NULL valnes are 
treated in SQL [MS93]. There are two important aspects to the nse of OEM Nil 
for implementing three-valned logic: First, if the Where danse of a qnery is a 
mixed-boolean expression, a valne of OEM Nil for the Where danse is interpreted 
as false. Second, if a qnery retnrns a collection of OEM objects, any OEM Nil 
valnes are filtered ont from the resnlt; however, OEM Nil valnes may appear in 
OEM components of strnctnred qnery resnlts. 

The latter point is illnstrated by the following two qneries, which have iden- 
tical From clanses bnt differ in their Select clanses. 

Select P.prodnum + 100 Select tuple(prod:P, 

From Reviews . Consumers Inc R, newpnum:P .prodnum+100) 

R. subject P From Reviews . Consumers Inc R, 

R. subject P 

In both qneries, since OEM entry point Reviews is nsed, variable P is of 
type OEM. If for a particnlar P binding passed to the Select danse, the valne 
of P.prodnum -1- 100 is OEM Nil (becanse P.prodnum cannot be coerced to an 
integer or a real, e..g, it has string valne “123A”), then the valne OEMNil is 
discarded from the qnery resnlt. On the other hand, the qnery on the right has 
a (strnctnred) Select danse of type tuple(prod:OEM, newpnum:OEM). For this 
qnery, a similar P binding wonld prodnce an OEMNil valne for the newpnum 
component of the tnple. That OEMNil valne is retained in the resnlt, since it is 
part of a tnple in which the prod component has the non- nil valne P. 

OEM operands can be used in any expression without type error. 
Since an OEM object can be a container for any type, OQL'^ allows OEM 
operands to be nsed in any expression withont type error. An OEM expression 
may therefore be nsed in a qnery as an atomic valne, a collection, a strnctnre, or 
a class. For instance, in OQL, the index expression X[Y] is a legitimate expres- 
sion only when X is of type string, list(T), or array(T), and when Y is an integer. 
In OQL'^, X[Y] is also a legitimate expression when X, Y, or both are of type 
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OEM. Consider when X is of type OEM. If Y is an integer or an OEMatomic 
object encapsnlating an integer (or a string coercible to an integer), and if X 
is of type OEM(string), then X[Y] is of type OEM(char) and encapsnlates the 
Y*^ character in the string. If X is of type OEM(list(T)) or OEM(array(T)), then 
X[Y] has type OEM(T) and encapsnlates the Y*^ element in the ordered col- 
lection encapsnlated by X. Finally, if X is of type OEMcomplexIist, then X[Y] 
retnrns the Y*^ snbobject of X. For all other types of X, the valne of X[Y] is 
OEMNil. 

6 Implementation 

The Ozone system is fnlly implemented on top of the ODMG-compliant O 2 
object database system. An Ozone database is simply an O 2 database whose 
schema has been enhanced by a schema manager modnle that adds predefined 
classes for the storage and management of OEM objects. OQL'^ qneries are 
compiled by a preprocessor modnle into intermediate OQL qneries on the Ozone 
database, and qnery resnlts are presented to the nser via a postprocessor modnle. 
Ozone also provides a loader modnle for bnlk-loading semistrnctnred data from 
files, and the semistrnctnred data in the load files may inclnde references to 
strnctnred data. See Fignre 4 for a depiction of the overall architectnre. In this 
section, we first describe how OEM objects are represented in an Ozone database, 
and then describe the main aspects of the translation from OQL'^ to OQL. Space 
limitations preclnde a complete description of onr implementation. 

y 

ODMG class declarations OQL queries Query results 




Ozone schema manager Ozone preprocessor Ozone postprocessor 




Fig. 4. Architecture of the Ozone system 
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6.1 Representation of OEM objects in O 2 

Ozone defines a class OEM to represent OEM objects in O 2 . This class is the 
base class for OEM objects, and the different kinds of OEM objects introdnced 
in Section 3 are represented by snbclasses of OEM, as follows. 

Complex OEM objects. The class OEMcomplex represents complex OEM ob- 
jects. This class has two snbclasses for representing ordered and nnordered com- 
plex OEM objects: OEMcomplexset and OEMcomplexIist. Since complex OEM ob- 
jects are collections of [label, value) pairs, the types encapsnlated by these classes 
are set(tuple(label:string, value:OEM)) and list(tuple(label:string, value:OEM)) re- 
spectively. 

Atomic OEM objects encapsulating atomic values. Atomic OEM ob- 
jects encapsnlating atomic valnes also are represented by snbclasses of OEM. 
For example, the class OEMJnteger (encapsnlating the type integer) represents 
OEM(integer) objects. The class OEM_Object (encapsnlating the class Object) 
represents the type OEM(Object). (Recall that Object is a snpertype of all 
classes in ODMG.) The remaining atomic classes are OEM_real, OEM_boolean, 
OEM_char, OEM_string, OEM_binary, and OEM_OEM (encapsnlating the class 
OEM and representing the type OEM(OEM)). The class OEM itself encapsnlates 
the type nil, and the extent for this class consists of the single named object 
OEM Nil. The classes described here and in the previons section are the fixed 
classes of Ozone — they are present in every Ozone schema. 

Atomic OEM objects encapsulating ODMG objects. Recall from Section 
3 that an atomic OEM object can encapsnlate a valne of any ODMG type. We 
distingnish between two cases for non-atomic types: classes (this section), and 
non-atomic literal types snch as tnples or collections (next section). 

When C is a class, atomic OEM objects of type OEM(C) conld be represented 
as instances of the class OEM_Object (previons section): since Object is the root of 
the ODMG class hierarchy, the class OEM_Object can store references to objects 
in any class. However, the nse of a single class has performance limitations, since 
the exact types of encapsnlated objects wonld have to be determined at rnn-time 
throngh potentially expensive schema looknps. Therefore, for performance rea- 
sons, Ozone defines a proxy class OEM_C for representing atomic OEM objects 
of type OEM(C) for each nser-defined class C. Operations on an object belonging 
to a proxy class need not consnlt the schema and can exploit the strnctnral prop- 
erties of the encapsnlated strnctnred object (whose exact type is known from the 
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proxy class). The Field() method, described later in Section 6.2, is example of 
an operation that can exploit strnctnre throngh this approach. In onr rnnning 
example, the Ozone schema manager defines proxy classes OEM_Catalog (encap- 
snlating the class Catalog), OEM_Product (encapsnlating the class Product), and 
OEM_Company (encapsnlating the class Company). These classes represent the 
types OEM(Catalog), OEM(Product), and OEM(Company). 



Atomic OEM objects encapsulating non-atomic literals. Atomic OEM 
objects encapsnlating non-atomic literal valnes (tuple, set, list, etc.) conld be 
represented by eqnivalent complex OEM objects. For instance, an atomic OEM 
object encapsnlating the valne tuple(name: “foo” , oid:4) conld be represented 
in Ozone by an eqnivalent OEMcomplexset object with two children:( “name” , 
OEM_string( “foo” )) and (“oid”, OEM_lnteger(3)). 

For performance reasons once again. Ozone defines additional OEM snb- 
classes encapsnlating the types of non-atomic class properties (attribntes, rela- 
tionships, and methods) since these are the non-atomic literal types that are 
most commonly enconntered in qneries. For each snch [property) of type P in 
each class C, Ozone creates a new auxiliary class OEM ^{property) to repre- 
sent atomic OEM objects of type OEM(P). As with proxy classes, a qnery on 
an anxiliary class object is faster than a qnery over an eqnivalent OEMcomplex 
object representing the same data. Of conrse, it is not possible to define anxiliary 
classes encapsnlating all possible non-atomic types, since the space of temporary 
literal types that can be synthesized by qneries and snbqneries is infinite. For 
valnes of snch types. Ozone mnst create eqnivalent OEM objects. 

Referring again to onr rnnning example, the produces relationship of Com- 
pany has the non-atomic literal type list(Product), and the Ozone schema man- 
ager creates the anxiliary class OEM_Company_produces encapsnlating the type 
list(Product). This anxiliary class represents the type OEM(list(Product)). The 
class Company also has the non-atomic address attribnte, and the schema man- 
ager creates the anxiliary class OEM_Company_address. Similarly, the schema 
manager creates the classes OEM_Catalog_vendors and OEM_Catalog_products for 
the vendors attribnte and the products attribnte in the Catalog class. 

The complete set of fixed, proxy, and anxiliary classes added by the Ozone 
schema manager to the schema of Section 2.3 is depicted in Fignre 5. Note that 
this class hierarchy is invisible to the nser; the type OEM is the only nser- visible 
type added by Ozone. 
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Fig. 5 . Classes added by Ozone to our example schema 



6.2 Translation from OQL'^ to OQL 

OQL'^ expressions involving OEM operands are implemented by methods defined 
in the class OEM. This class defines methods for navigation, nser-specified co- 
ercion, the Construct fnnction described in Section 5.2, and performing different 
kinds of nnary and binary operations (snch as arithmetic, boolean, comparison, 
and set operations). In the remainder of this section, we first describe the nse 
of methods for implementing path expressions in OQL'^. Then we describe the 
implementation of nser-specified coercion and constrnction. We conclnde with a 
brief illnstration of how methods are nsed to implement mixed expressions. 

Implementation of OQL'^ path expressions. Strnctnred data is qneried by 
OQL'^ in exactly the same way as by OQL on a standard O 2 database, i.e., an 
OQL'^ qnery over strnctnred (pnre ODMG) data does not need to be modified 
by the Ozone preprocessor. However, navigation of OEM objects is performed 
by methods snch as the Field() method, which takes a label argnment and pro- 
dnces the set of children with matching labels. Other navigational methods are 
discnssed briefly at the end of this section. Let ns illnstrate how the Ozone pre- 
processor rewrites OQL'^ qneries on OEM objects nsing the Field() method. The 
following OQL'^ qnery (on the left) that selects all compatible prodncts for all 
prodncts in the broker catalog is translated by the Ozone preprocessor to the 
OQL qnery shown on the right:^ 

^ We translate all range variables to the “V In...” style, since 02’s version of OQL 
supports only this form. 
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Select C 
From 

Broker . product s P , 

P .prodinfo . compatible C 



Select C 
From 

P In Broker .products , 

C In P .prodinfo .Field("compatible") 



For complex OEM objects (ordered or unordered) the implementation of the 
Field() method is straightforward: it iterates through the encapsulated collection 
of [label, value) pairs and retrieves all children whose labels match the label 
argument to the method. 

For atomic OEM classes encapsulating non-atomic values, the definition of 
the Field() method is designed to be consistent with the rules for navigating past 
atomic OEM objects as defined in Section 5.1. For each proxy or auxiliary OEM 
class, Ozone automatically generates its Field() method (at class definition time) 
using the O 2 metaschema — an API allowing schema lookups and manipulations 
within an application. The Field() method matches the label argument with each 
attribute, relationship, and method (without arguments) in the class and returns 
OEM objects encapsulating any matching property or the return value of any 
matching method. Atomic properties are returned as instances of corresponding 
atomic OEM classes (for instance, an attribute of type integer would be returned 
in an object of type OEM .integer), while non-atomic properties are returned as 
instances of proxy or auxiliary classes (for instance, the produces relationship of a 
Company object would be returned in an object of type OEM_Company_produces). 

As described in Section 5.1, OQL'^ allows method invocations on OEM ob- 
jects with the following semantics: if the OEM object encapsulates an object in a 
class with a matching method, the method is invoked and a singleton set contain- 
ing the OEM proxy object encapsulating the method’s return value is generated, 
otherwise the empty set is returned. Methods without arguments are handled 
in Ozone in the same way as class attributes as described above. Methods with 
one or more arguments are handled by the lnvoke() method whose signature is: 

set(OEM) lnvoke(string methodName, list(string) argstext) 

Note that any valid OQL'^ expression can be used as an argument to a 
method on an OEM object. The argstext parameter lists the actual query texts 
for these expressions. In a proxy OEM class, if methodName matches any method 
in the encapsulated class, lnvoke() uses the Ozone preprocessor to translate each 
OQL'^ expression in argstext into its intermediate OQL form. If the argument 
types match the method’s signature, lnvoke() uses the O 2 API to invoke the 
method on the encapsulated ODMG object (the intermediate OQL expressions 
are used as arguments in this call). For example, an OQL'^ query invoking the 
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inventoryO method on all subjects of reviews by Consumersinc is shown below, 
together with its intermediate OQL form: 



Select C.inventory("cameral") 
From 

Reviews . Consumersinc R, 

R. subject C 



Select C . Invoke ("inventory" , 

list(""cameral"")) 

From 

R In Reviews .Field("ConsumersInc") , 
C In R.FieldC subject") 



For any C binding that is of type OEM(Company), the lnvoke() method applies 
the inventoryO method with the argument “cameral” to the encapsulated Com- 
pany object. Since this argument is an atomic OQL expression, it does not need 
preprocessing. For such C bindings, lnvoke() returns a set containing a single 
OEMJnteger object storing the result of applying the method. For all other C 
bindings, lnvoke() returns the empty set. 



Path expressions with wildcards and regular expression operators. 
Recall that path expressions in OQL'^ may contain wildcards and regular ex- 
pression operators [AQM+97]. Wildcards are supported by the Field() method, 
whose label argument may contain wildcards. Regular expression operators are 
implemented through standard set operations and three additional navigational 
methods in the OEM class: 



— Closure(): X.CIosure( “foo” ) returns the set of all objects reachable from the 
object X by following zero or more edges labeled “foo” . 

— UnionClosure(): X.UnionClosure(set( “fooi” ,. . ., “foo„” )) returns the set of all 
objects reachable from the object X by following zero or more edges whose 
labels match any one of “fooi”,. . ., “foo„” . 

— NestedClosure(): X.NestedClosure(query) returns the set of all objects ob- 
tained by zero or more executions of query. Variable self is used in the query 
to reference the object on which the method is invoked. 
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As an example, the following translation nses the NestedClosure and Closure 
methods as well as a Union operation to implement a nested regnlar expression.® 



Select D Select D 

From A(.B|(.C)*)* D From 

D in A. NestedClosure ( 

"Select X 

From X in (self .FieldC'B") 
Union 

self .ClosureC'C"))") 



Implementation of user-specified coercion and construction. As descri- 
bed in Section 5.1, OQL'^ allows an object X in any class C to be converted into 
an OEM proxy object of type OEM(C') throngh the nser-specified cast expression 
(OEM) X. This coercion is implemented in Ozone simply by creating an appro- 
priate object in the proxy class for C. For instance, if X is of type Company, 
(OEM) X is translated by the preprocessor to OEM_Company(A), which creates 
a proxy for a Company object. 

OQL'^ also allows an OEM object X to be coerced explicitly to any class 
C nsing the expression Coerce(C', X). This Coerce fnnction of Definition 2 is 
implemented as a method Coerce in the class OEM. One difRcnlty is that the 
resnlt of the Coerce fnnction is of type set(C'), where C can be any ODMG class 
in the schema, i.e., we have introdnced polymorphism that is not snpported 
by ODMG. In onr implementation, method Coerce therefore has the fixed re- 
tnrn type set(Object), which is snitable for retnrning objects of any type. The 
preprocessor then inserts an OQL cast to obtain objects of the proper type. 

Finally, as described in Section 5.2, OQL'^ allows the constrnction of a strnc- 
tnred ODMG valne of type T from an OEM object O nsing the expression 
Construct(T, O). The present Ozone prototype reqnires T to be a class type 
and implements the Construct fnnction as a method Construct in the OEM class. 
For reasons analogons to the Coerce method, the retnrn type of the method is 
set(Object). Method Construct nses the O 2 metaschema to create an object in 
the class C, then attempts to constrnct the different attribntes of the object 
nsing the rnles described in Section 5.2. If the constrnction is not snccessfnl, the 
empty set is retnrned. Once again, the preprocessor mnst insert an OQL cast to 
obtain objects in the specified class. 

® In this particular example the regular expression could be simplified to A(.B| .C)*, 
in which case the translation uses only the UnionClosure method, but we translate 
the unsimplified expression for illustrative purposes. 
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Implementation of mixed expressions. Mixed expressions involving OEM 
operands (Section 5.3) are implemented throngh methods. We will illnstrate 
the comparison operator “<” as an example. In OQL'^, an OEM object can 
be compared with valnes of the following atomic ODMG types: integer, real, 
boolean, and string. For these types the OEM class defines the comparison meth- 
ods Less_integer(integer value), Less_real(real value), etc. An OEM object also 
can be compared with another OEM object (the comparison is interpreted at 
rnn-time based on the exact types of the valnes encapsnlated by the two ob- 
jects), and for this pnrpose the Less_OEM(OEM value) method is provided. The 
“<” operator also denotes set containment. The present Ozone prototype de- 
fines set comparison methods only for those nnordered collections types that 
have corresponding anxiliary classes. In onr example schema, we thns define 
the method Less_set_Product(set(Product) value) since OEM_Catalog_products is 
of type set(Product), and Less_set_Company(set(Company) value) since the class 
OEM_Catalog_vendors is of type set(Company). The retnrn type of all of these 
comparison methods is OEM, and the retnrn valne is always one of the following 
atomic OEM valnes: OEM_boolean(true), OEM_boolean(false), or OEM Nil. 

As examples of the nse of comparison methods, let A be an OEM object. 
Ozone translates the expression [X < 5) to A.Less_integer(5) and the expression 
[X < “abc”) to A.Less_string(“o6c”). If T is of type set(Company), the expression 
{X T) is translated to A.Less_set_Company(T). Thns, the decision of which 
comparison method shonld be invoked is made statically. 

7 Conclusions 

We have extended ODMG with the ability to integrate semistrnctnred data with 
strnctnred data in a single database, and we have extended the semantics of OQL 
to allow qneries over snch hybrid data. As far as we know, onr work is the first 
to provide trne integration of semistrnctnred and strnctnred data with a nnified 
qnery langnage. We feel that this direction of work is particnlarly important 
as more and more strnctnred data sonrces incorporate semistrnctnred XML in- 
formation, and vice-versa. We have bnilt Ozone, a system that implements onr 
ODMG and OQL extensions on top of the O 2 object-oriented database system. 
Onr fntnre work will proceed in the following directions: 

— Ozone performance: We plan to investigate optimizations that wonld al- 
low the navigation of OEM proxies for strnctnred data to be as fast as 
standard OQL. We also plan to stndy general optimizations for navigating 
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semistructured data [MW99] in the context of Ozone, and explore the nse 
of mechanisms provided by O 2 (snch as clnstering and indexing) to improve 
performance. Finally, we wonld like to detect regnlarity in semistrnctnred 
data, and determine whether we can exploit snch regnlarity by nsing ODMG 
classes to represent snch data [NAM98]. 

— Object-Relational Ozone: We plan to define a similar semistrnctnred exten- 
sion to the object-relational data model [SM96], and define semantics for 
the SQL-3 qnery langnage in order to qnery hybrid (object-relational pins 
semistrnctnred) data. 

— Applications: We intend to stndy a snite of applications that can take advan- 
tage of onr hybrid approach, in order to identify any missing fnnctionality 
and performance bottlenecks. We also plan to investigate general design is- 
snes for hybrid-data applications. 
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